19,997 machine learning datasets
19,997 dataset results
Created from endoscopic video feeds of real-world surgical procedures. Overall, the data consists of 307 images, each of which is annotated for the organs and different surgical instruments present in the scene.
A new dataset on the baseball domain.
The MLB-YouTube dataset is a new, large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube with over 42 hours of video footage. The dataset consists of two components: segmented videos for activity recognition and continuous videos for activity classification. It is quite challenging as it is created from TV broadcast baseball games where multiple different activities share the camera angle. Further, the motion/appearance difference between the various activities is quite small.
The MLQE dataset is a dataset for sentence-level Machine Translation Quality Estimation. It consists of 6 language pairs representing NMT training in high, medium, and low-resource scenarios. The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.
Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.
A dataset which provides detailed annotations for activity recognition.
N-Digit MNIST is a multi-digit MNIST-like dataset.
Contains 13.6k masked-word-prediction probes, 10.5k for fine-tuning and 3.1k for testing.
The NYU Symmetry database contains 176 single-symmetry and 63 multiple-symmetry images (.png files) with accompanying ground-truth annotations (.mat files). Also included are a .m file to visualize the annotations on top of the images, and a .txt file with instructions on how to interpret the .mat annotations.
OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.
Provides consistent ID annotations across multiple days, making it suitable for the extremely challenging problem of person search, i.e., where no clothing information can be reliably used. Apart this feature, the P-DESTRE annotations enable the research on UAV-based pedestrian detection, tracking, re-identification and soft biometric solutions.
PedX is a large-scale multi-modal collection of pedestrians at complex urban intersections. The dataset provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. The data was captured using two pairs of stereo cameras and four Velodyne LiDAR sensors.
PerSenT is a dataset of crowd-sourced annotations of the sentiment expressed by the authors towards the main entities in news articles. The dataset also includes paragraph-level sentiment annotations to provide more fine-grained supervision for the task.
PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities.
Contains 10,000 fine-grained SKU-level products frequently bought by online customers in JD.com.
A dataset composed of 12 different 3D scenes and RGB sequences of 20 subjects moving in and interacting with the scenes.
A dataset which contains over 200 apartments.
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.
RuSentRel is a corpus of analytical articles translated into Russian texts in the domain of international politics obtained from foreign authoritative sources. The collected articles contain both the author's opinion on the subject matter of the article and a large number of references mentioned between the participants of the described situations. In total, 73 large analytical texts were labeled with about 2000 relations.
Simitate is a hybrid benchmarking suite targeting the evaluation of approaches for imitation learning. It consists on a dataset containing 1938 sequences where humans perform daily activities in a realistic environment. The dataset is strongly coupled with an integration into a simulator. RGB and depth streams with a resolution of 960×540 at 30Hz and accurate ground truth poses for the demonstrator's hand, as well as the object in 6 DOF at 120Hz are provided. Along with the dataset the 3D model of the used environment and labelled object images are also provided.