19,997 machine learning datasets
19,997 dataset results
A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.
Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.
The Natural-Color Dataset (NCD) is an image colorization dataset where images are true to their colors. For example, a carrot will have an orange color in most images. Bananas will be either greenish or yellowish. It contains 723 images from the internet distributed in 20 categories. Each image has an object and a white background.
Phenotype-Gene Relations (PGR) is a corpus that consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations.
Q-Traffic is a large-scale traffic prediction dataset, which consists of three sub-datasets: query sub-dataset, traffic speed sub-dataset and road network sub-dataset.
10,000 news collected from a social network in Vietnam.
The Relative Size dataset contains 486 object pairs between 41 physical objects. Size comparisons are not available for all pairs of objects (e.g. bird and watermelon) because for some pairs humans cannot determine which object is bigger.
RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.
A dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. The dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame.
Dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex ethical situation, often posing moral dilemmas, paired with a distribution of judgments contributed by the community members.
Includes 4405 images with 111251 heads annotated.
This corpus has been collected from free or free for research sources at the Internet:
A dataset of utterances, incorrect SQL interpretations and the corresponding natural language feedback.
A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.
Collected from top 10 most popular clothing/wearable brandname logos captured in rich visual context.
TutorialBank is a publicly available dataset which aims to facilitate NLP education and research. The dataset consists of links to over 6,300 high-quality resources on NLP and related fields. The corpus’s magnitude, manual collection and focus on annotation for education in addition to research differentiates it from other corpora.
Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.
The Urban Environments dataset is a dataset of 20 land use classes across 300 European cities paired with satellite imagery data.
Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.
Aims to facilitate research in caricature recognition. All the caricatures and face images were collected from the Web. Compared with two existing datasets, this dataset is much more challenging, with a much greater number of available images, artistic styles and larger intra-personal variations.