19,997 machine learning datasets
19,997 dataset results
This dataset provides a new split of VQA v2 (similarly to VQA-CP v2), which is built of questions that are hard to answer for biased models.
BiSECT is a dataset for sentence simplification, which is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. These were obtained by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language.
TIAGE is a topic-shift aware dialog benchmark constructed utilizing human annotations on topic shifts. Based on TIAGE, three tasks can be conducted to investigate different scenarios of topic-shift modeling in dialog settings: topic-shift detection, topic-shift triggered response generation and topic-aware dialog generation.
Panoptic nuScenes is a benchmark dataset that extends the popular nuScenes dataset with point-wise groundtruth annotations for semantic segmentation, panoptic segmentation, and panoptic tracking tasks.
CARL (context adaptive RL) provides highly configurable contextual extensions to several well-known RL environments. It's designed to test your agent's generalization capabilities in all scenarios where intra-task generalization is important.
The datasets introduced in Chapter 6 of my PhD thesis are below. See the thesis for more details. If you use any of these datasets for research purposes you should use the following citation in any resulting publications:
Five curated datasets of one-liner commits from open-source projects. In total, they are composed of 58069 one-liner commits.
NTU RGB+D 2D is a curated version of NTU RGB+D often used for skeleton-based action prediction and synthesis. It contains less number of actions.
LSVTD is a large scale video text dataset for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. LSVTD covers a wide range of 13 indoor (eg. bookstore, shopping mall) and 9 outdoor scenarios, which is more than 3 times the diversity of IC15.
ACAV100M processes 140 million full-length videos (total duration 1,030 years) which are used to produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).
GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context learning. The pretraining data is generated by a mixture of HMMs and the in-context learning prompt examples are also generated from HMMs (either from the mixture or not). The prompt examples are out-of-distribution with respect to the pretraining data since every example is independent, concatenated, and separated by delimiters. The GitHub repository provides code to generate GINC-style datasets of varying vocabulary sizes, number of HMMs, and other parameters.
The video deployed parameter space is continuously increasing to provide more realistic and immersive experiences to global streaming and social media viewers. However, increments in video parameters such as spatial resolution or frame rate are inevitably associated with larger data volumes. Transmitting increasingly voluminous videos through limited bandwidth networks in a perceptually optimal way is a present challenge affecting billions of viewers. One recent practice adopted by the video service providers is space-time resolution adaptation in conjunction with video compression. Consequently, it is important to understand how different levels of space-time subsampling and compression affect the perceptual quality of videos. Towards making progress in this direction, we constructed a large new resource, called the ETRI-LIVE Space-Time Subsampled Video Quality (ETRI-LIVE-STSVQ) database, containing 437 videos generated by applying various levels of combined space-time subsampling and
BOVText is a new large-scale benchmark dataset named Bilingual, Open World Video Text(BOVText), the first large-scale and multilingual benchmark for video text spotting in a variety of scenarios. All data are collected from KuaiShou and YouTube
The IWSLT 2017 translation dataset.
Ubisoft La Forge Animation Dataset ("LAFAN1") Ubisoft La Forge Animation dataset and accompanying code for the SIGGRAPH 2020 paper Robust Motion In-betweening.
PhoMT is a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs for machine translation.
The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions.
English subset of the SLAKE dataset, comprising 642 images and more than 7,000 question–answer pairs.
Type Inference dataset for TypeScript. Click on DOI tag for dataset files.
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.