19,997 machine learning datasets
19,997 dataset results
XD-Violence is a large-scale audio-visual dataset for violence detection in videos.
CCNet is a dataset extracted from Common Crawl with a different filtering process than for OSCAR. It was built using a language model trained on Wikipedia, in order to filter out bad quality texts such as code or tables. CCNet contains longer documents on average compared to OSCAR with smaller—and often noisier—documents weeded out.
The Implicit Hate corpus is a dataset for hate speech detection with fine-grained labels for each message and its implication. This dataset contains 22,056 tweets from the most prominent extremist groups in the United States; 6,346 of these tweets contain implicit hate speech.
WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
We introduce a dataset of 147 object categories containing over 6000 images that are suitable for the few-shot counting task. We collected and annotated images ourselves. Our dataset consists of 6135 images across a di- verse set of 147 object categories, from kitchen utensils and office stationery to vehicles and animals. The object count in our dataset varies widely, from 7 to 3731 objects, with an average count of 56 objects per image. In each image, each object instance is annotated with a dot at its approxi- mate center. In addition, three object instances are selected randomly as exemplar instances; these exemplars are also annotated with axis-aligned bounding boxes.
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.
BLINK is a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations¹².
The Stanford Dogs dataset contains 20,580 images of 120 classes of dogs from around the world, which are divided into 12,000 images for training and 8,580 images for testing.
The Slashdot dataset is a relational dataset obtained from Slashdot. Slashdot is a technology-related news website know for its specific user community. The website features user-submitted and editor-evaluated current primarily technology oriented news. In 2002 Slashdot introduced the Slashdot Zoo feature which allows users to tag each other as friends or foes. The network cotains friend/foe links between the users of Slashdot. The network was obtained in February 2009.
Composition-1K is a large-scale image matting dataset including 49300 training images and 1000 testing images.
WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech sources in addition to the existing noise. Room impulse responses were generated and convolved using pyroomacoustics. Reverberation times were chosen to approximate domestic and classroom environments (expected to be similar to the restaurants and coffee shops where the WHAM! noise was collected), and further classified as high, medium, and low reverberation based on a qualitative assessment of the mixture’s noise recording.
Includes 4000 images; 200 from each of 20 categories covering different types of scenes such as Cartoons, Art, Objects, Low resolution images, Indoor, Outdoor, Jumbled, Random, and Line drawings.
The EYEDIAP dataset is a dataset for gaze estimation from remote RGB, and RGB-D (standard vision and depth), cameras. The recording methodology was designed by systematically including, and isolating, most of the variables which affect the remote gaze estimation algorithms:
The TotalCapture dataset consists of 5 subjects performing several activities such as walking, acting, a range of motion sequence (ROM) and freestyle motions, which are recorded using 8 calibrated, static HD RGB cameras and 13 IMUs attached to head, sternum, waist, upper arms, lower arms, upper legs, lower legs and feet, however the IMU data is not required for our experiments. The dataset has publicly released foreground mattes and RGB images. Ground-truth poses are obtained using a marker-based motion capture system, with the markers are <5mm in size. All data is synchronised and operates at a framerate of 60Hz, providing ground truth poses as joint positions.
The Wireframe dataset consists of 5,462 images (5,000 for training, 462 for test) of indoor and outdoor man-made scenes.
SciDocs evaluation framework consists of a suite of evaluation tasks designed for document-level tasks.
PGM dataset serves as a tool for studying both abstract reasoning and generalisation in models. Generalisation is a multi-faceted phenomenon; there is no single, objective way in which models can or should generalise beyond their experience. The PGM dataset provides a means to measure the generalization ability of models in different ways, each of which may be more or less interesting to researchers depending on their intended training setup and applications.
Europarl-ST is a multilingual Spoken Language Translation corpus containing paired audio-text samples for SLT from and into 9 European languages, for a total of 72 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012.
Lost and Found is a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of obstacle and free-space and provide a thorough comparison to several stereo-based baseline methods. The dataset will be made available to the community to foster further research on this important topic.
RELLIS-3D is a multi-modal dataset for off-road robotics. It was collected in an off-road environment containing annotations for 13,556 LiDAR scans and 6,235 images. The data was collected on the Rellis Campus of Texas A&M University and presents challenges to existing algorithms related to class imbalance and environmental topography. The dataset also provides full-stack sensor data in ROS bag format, including RGB camera images, LiDAR point clouds, a pair of stereo images, high-precision GPS measurement, and IMU data.