19,997 machine learning datasets
19,997 dataset results
PIE-Bench comprises 700 images featuring 10 distinct editing types. Images are evenly distributed in natural and artificial scenes (e.g., paintings) among four categories: animal, human, indoor, and outdoor. Each image in PIE-Bench includes five annotations: source image prompt, target image prompt, editing instruction, main editing body, and the editing mask. Notably, the editing mask annotation (indicating the anticipated editing region) is crucial in accurate metrics computations as we expect the editing to only occur within a designated area.
DSEC is a stereo camera dataset in driving scenarios that contains data from two monochrome event cameras and two global shutter color cameras in favorable and challenging illumination conditions. In addition, we collect Lidar data and RTK GPS measurements, both hardware synchronized with all camera data. One of the distinctive features of this dataset is the inclusion of VGA-resolution event cameras. Event cameras have received increasing attention for their high temporal resolution and high dynamic range performance. However, due to their novelty, event camera datasets in driving scenarios are rare. This work presents the first high-resolution, large-scale stereo dataset with event cameras.
Real 3D-AD is the first point cloud anomaly detection dataset for industrial products. Real3D-AD comprises a total of 1,254 samples that are distributed across 12 distinct categories. These categories include Airplane, Car, Candybar, Chicken, Diamond, Duck, Fish, Gemstone, Seahorse, Shell, Starfish, and Toffees. Each training sample is an absence of blind spots, and a realistic, high-accuracy prototype.
The ISNotes dataset is a corpus used for fine-grained Information Status (IS) classification. IS reflects the accessibility of a discourse entity based on the evolving discourse context and the speaker’s assumption about the hearer’s knowledge and beliefs. According to Markert et al. (2012), old mentions refer to entities that have been referred to previously; mediated mentions have not been mentioned before but are accessible to the hearer by reference to another old mention or to prior world knowledge; and new mentions refer to entities that are introduced to the discourse for the first time and are not known to the hearer before.
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enh
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions.
The NumGLUE dataset is a valuable resource developed by the Allen Institute for AI. It focuses on evaluating the performance of AI systems in mathematical reasoning tasks that involve numbers within natural language text. Here are the key details about NumGLUE:
CharXiv is a comprehensive evaluation suite for testing the chart understanding capabilities of Multimodal Large Language Models (MLLMs)¹². It was proposed to address the limitations of existing datasets that often focus on oversimplified and homogeneous charts with template-based questions¹².
We construct a fine-grained video dataset organized by both semantic and temporal structures, where each structure contains two-level annotations.
4D-DRESS is the first real-world 4D dataset of human clothing, capturing 64 human outfits in more than 520 motion sequences. These sequences include a) high-quality 4D textured scans; for each scan, we annotate b) vertex-level semantic labels, thereby obtaining c) the corresponding garment meshes and fitted SMPL(-X) body meshes. Totally, 4D-DRESS captures dynamic motions of 4 dresses, 28 lower, 30 upper, and 32 outer garments. For each garment, we also provide its canonical template mesh to benefit the future human clothing study.
Question answering over knowledge graphs (KG-QA) is a vital topic in IR. Questions with temporal intent are a special class of practical importance, but have not received much attention in research. We present EXAQT, the first end-to-end system for answering complex temporal questions that have multiple entities and predicates, and associated temporal conditions.
Desc: About of Text8
The CUHK Face Sketch FERET (CUFSF) is a dataset for research on face sketch synthesis and face sketch recognition. It contains two types of face images: photo and sketch. Total 1,194 images (one image per subject) were collected with lighting variations from the FERET dataset. For each subject, a sketch is drawn with shape exaggeration.
The iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition is organized by iPinYou from April 1st, 2013 to December 31st, 2013.The competition has been divided into three seasons. For each season, a training dataset is released to the competition participants, the testing dataset is reserved by iPinYou. The complete testing dataset is randomly divided into two parts: one part is the leaderboard testing dataset to score and rank the participating teams on the leaderboard, and the other part is reserved for the final offline evaluation. The participant's last offline submission is evaluated by the reserved testing dataset to get a team's offline final score. This dataset contains all three seasons training datasets and leaderboard testing datasets.The reserved testing datasets are withheld by iPinYou. The training dataset includes a set of processed iPinYou DSP bidding, impression, click, and conversion logs.
The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.
The ECGs in this collection were obtained using a non-commercial, PTB prototype recorder with the following specifications:
A new dataset of diverse traffic accidents.
The AMR Bank is a set of English sentences paired with simple, readable semantic representations. Version 3.0 released in 2020 consists of 59,255 sentences.
The Sku110k dataset provides 11,762 images with more than 1.7 million annotated bounding boxes captured in densely packed scenarios, including 8,233 images for training, 588 images for validation, and 2,941 images for testing. There are around 1,733,678 instances in total. The images are collected from thousands of supermarket stores and are of various scales, viewing angles, lighting conditions, and noise levels. All the images are resized into a resolution of one megapixel. Most of the instances in the dataset are tightly packed and typically of a certain orientation in the rage of [−15∘, 15∘].
The LOCATA dataset is a dataset for acoustic source localization. It consists of real-world ambisonic speech recordings with optically tracked azimuth-elevation labels.