19,997 machine learning datasets
19,997 dataset results
VQG is a collection of datasets for visual question generation. VQG questions were collected by crowdsourcing the task on Amazon Mechanical Turk (AMT). The authors provided details on the prompt and the specific instructions for all the crowdsourcing tasks in this paper in the supplementary material. The prompt was successful at capturing nonliteral questions. Images were taken from the MSCOCO dataset.
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
WikiTableQuestions is a question answering dataset over semi-structured tables. It is comprised of question-answer pairs on HTML tables, and was constructed by selecting data tables from Wikipedia that contained at least 8 rows and 5 columns. Amazon Mechanical Turk workers were then tasked with writing trivia questions about each table. WikiTableQuestions contains 22,033 questions. The questions were not designed by predefined templates but were hand crafted by users, demonstrating high linguistic variance. Compared to previous datasets on knowledge bases it covers nearly 4,000 unique column headers, containing far more relations than closed domain datasets and datasets for querying knowledge bases. Its questions cover a wide range of domains, requiring operations such as table lookup, aggregation, superlatives (argmax, argmin), arithmetic operations, joins and unions.
VeRi-776 is a vehicle re-identification dataset which contains 49,357 images of 776 vehicles from 20 cameras. The dataset is collected in the real traffic scenario, which is close to the setting of CityFlow. The dataset contains bounding boxes, types, colors and brands.
CoNLL-2014 will continue the CoNLL tradition of having a high profile shared task in natural language processing. This year's shared task will be grammatical error correction, a continuation of the CoNLL shared task in 2013. A participating system in this shared task is given short English texts written by non-native speakers of English. The system detects the grammatical errors present in the input texts, and returns the corrected essays. The shared task in 2014 will require a participating system to correct all errors present in an essay (i.e., not restricted to just five error types in 2013). Also, the evaluation metric will be changed to F0.5, weighting precision twice as much as recall.
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.
Reddit-5K is a relational dataset extracted from Reddit.
MPIIGaze is a dataset for appearance-based gaze estimation in the wild. It contains 213,659 images collected from 15 participants during natural everyday laptop use over more than three months. It has a large variability in appearance and illumination.
OPUS-100 is an English-centric multilingual corpus covering 100 languages. It was randomly sampled from the OPUS collection.
Over a period of three years (2009 - 2011) the daily news and weather forecast airings of the German public tv-station PHOENIX featuring sign language interpretation have been recorded and the weather forecasts of a subset of 386 editions have been transcribed using gloss notation. Furthermore, we used automatic speech recognition with manual cleaning to transcribe the original German speech. As such, this corpus allows to train end-to-end sign language translation systems from sign language video input to spoken language.
RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.
OPV2V is a large-scale open simulated dataset for Vehicle-to-Vehicle perception. It contains over 70 interesting scenes, 11,464 frames, and 232,913 annotated 3D vehicle bounding boxes, collected from 8 towns in CARLA and a digital town of Culver City, Los Angeles.
The BRATS2017 dataset. It contains 285 brain tumor MRI scans, with four MRI modalities as T1, T1ce, T2, and Flair for each scan. The dataset also provides full masks for brain tumors, with labels for ED, ET, NET/NCR. The segmentation evaluation is based on three tasks: WT, TC and ET segmentation.
Multilingual LibriSpeech is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages.
Open Catalyst 2020 is a dataset for catalysis in chemical engineering. Focusing on molecules that are important in renewable energy applications, the OC20 data set comprises over 1.3 million relaxations of molecular adsorptions onto surfaces, the largest data set of electrocatalyst structures to date.
PRW is a large-scale dataset for end-to-end pedestrian detection and person recognition in raw video frames. PRW is introduced to evaluate Person Re-identification in the Wild, using videos acquired through six synchronized cameras. It contains 932 identities and 11,816 frames in which pedestrians are annotated with their bounding box positions and identities.
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).
The CMU CoNaLa, the Code/Natural Language Challenge dataset is a joint project from the Carnegie Mellon University NeuLab and Strudel labs. Its purpose is for testing the generation of code snippets from natural language. The data comes from StackOverflow questions. There are 2379 training and 500 test examples that were manually annotated. Every example has a natural language intent and its corresponding python snippet. In addition to the manually annotated dataset, there are also 598,237 mined intent-snippet pairs. These examples are similar to the hand-annotated ones except that they contain a probability if the pair is valid.
Kubric is a data generation pipeline for creating semi-realistic synthetic multi-object videos with rich annotations such as instance segmentation masks, depth maps, and optical flow.