3,148 machine learning datasets
3,148 dataset results
This is the small version of the MuMiN dataset.
This is the medium version of the MuMiN dataset.
This is the large version of the MuMiN dataset.
TraVLR is a synthetic dataset comprising four visio-linguistic reasoning tasks. Each example encodes the scene bimodally such that either modality can be dropped during training/testing with no loss of relevant information. TraVLR's training and testing distributions are also constrained along task-relevant dimensions, enabling the evaluation of out-of-distribution generalisation.
Sign Language Datasets for French Belgian Sign Language This dataset is built upon the work of Belgian linguists from the University of Namur. During eight years, they've collected and annotated 50 hours of videos depicting sign language conversation. 100 signers were recorded, making it one of the most representative sign language corpus. The annotation has been sanitized and enriched with metadata to construct two, easy to use, datasets for sign language recognition. One for continuous sign language recognition and the other for isolated sign recognition.
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
NELA-GT-2021 is the fourth installment of the NELA-GT datasets, NELA-GT-2021. The dataset contains 1.8M articles from 367 outlets between January 1st, 2021 and December 31st, 2021. Just as in past releases of the dataset, NELA-GT-2021 includes outlet-level veracity labels from Media Bias/Fact Check and tweets embedded in collected news articles.
This dataset is for evaluating the task of Black-box Multi-agent Integration which focuses on combining the capabilities of multiple black-box conversational agents at scale. It provides data to explore two main frameworks of exploration: question agent pairing and question response pairing.
This is a set of debiased Natural Language Inference (NLI) datasets produced by the paper Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets. The datasets are constructed by augmenting SNLI or MNLI with data samples that are generated to mitigate the spurious correlations in the original datasets. Please visit this repository for more details.
The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, and will serve as evaluation basis for the Video Browser Showdown 2019-2021 and TREC Video Retrieval (TRECVID) Ad-Hoc Video Search tasks 2019-2021. The dataset comes with a shot segmentation (around 1 million shots) for which we analyze content specifics and statistics. Our research shows that the content of V3C1 is very diverse, has no predominant characteristics and provides a low self-similarity. Thus it is very well suited for video retrieval evaluations as well as for participants of TRECVID AVS or the VBS.
The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, The test video collection for TRECVID-AVS2019-TRECVID-AVS2021, which contains 1,082,649 web video clips, with even more diverse content, no predominant characteristics and low self-similarity.
The Room environment - v0
1.9K Korean Online Hate Speech Comments for Multilabel Classification (Annotated by Three Independent Labelers per Data)
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".
The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators.
APEACH is the first crowd-generated Korean evaluation dataset for hate speech detection. Sentences of the dataset are created by anonymous participants using an online crowdsourcing platform DeepNatural AI.
This dataset contain online reviews gathered from google reviews written by north american casino users. explain motivations and summary of its content. Can be used to study user experience and relative research directions such as cultural impacts on latency of aspects, domain importance, sentiment analysis, opinion mining, aspect-based sentiment analysis, etc.
The dataset contains 60,000 Stack Overflow questions from 2016-2020, classified into three categories: