3,148 machine learning datasets
3,148 dataset results
iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic. Each sarcastic tweet is further labelled for one of the following types of ironic speech:
Extracted from the Tashkeela Corpus, the dataset consists of 55K lines containing about 2.3M words.
CLEVR-Ref+ is a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators.
DialoGLUE is a natural language understanding benchmark for task-oriented dialogue designed to encourage dialogue research in representation-based transfer, domain adaptation, and sample-efficient task learning. It consisting of 7 task-oriented dialogue datasets covering 4 distinct natural language understanding tasks.
FreebaseQA is a data set for open-domain QA over the Freebase knowledge graph. The question-answer pairs in this data set are collected from various sources, including the TriviaQA data set and other trivia websites (QuizBalls, QuizZone, KnowQuiz), and are matched against Freebase to generate relevant subject-predicate-object triples that were further verified by human annotators. As all questions in FreebaseQA are composed independently for human contestants in various trivia-like competitions, this data set shows richer linguistic variation and complexity than existing QA data sets, making it a good test-bed for emerging KB-QA systems.
Japanese-English Subtitle Corpus is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web.
LABR is a large sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars.
The Norwegian Review Corpus (NoReC) was created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author.
Question-Answer Meaning Representation (QAMR) represents a predicate-argument structure of a sentence with a set of question-answer pairs, so that annotations can be easily provided by non-experts. QAMR is a dataset of over 5,000 sentences and 100,000 questions created by crowdsourcing workers.
VQA-E is a dataset for Visual Question Answering with Explanation, where the models are required to generate and explanation with the predicted answer. The VQA-E dataset is automatically derived from the VQA v2 dataset by synthesizing a textual explanation for each image-question-answer triple.
Quasimodo is commonsense knowledge base that focuses on salient properties of objects. We provide several subsets:
Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
ConvQuestions is the first realistic benchmark for conversational question answering over knowledge graphs. It contains 11,200 conversations which can be evaluated over Wikidata. They are compiled from the inputs of 70 Master crowdworkers on Amazon Mechanical Turk, with conversations from five domains: Books, Movies, Soccer, Music, and TV Series. The questions feature a variety of complex question phenomena like comparisons, aggregations, compositionality, and temporal reasoning. Answers are grounded in Wikidata entities to enable fair comparison across diverse methods. The data gathering setup was kept as natural as possible, with the annotators selecting entities of their choice from each of the five domains, and formulating the entire conversation in one session. All questions in a conversation are from the same Turker, who also provided gold answers to the questions. For suitability to knowledge graphs, questions were constrained to be objective or factoid in nature, but no other r
BUG is a large-scale gender bias dataset of 108K diverse real-world English sentences, sampled semiautomatically from large corpora using lexical syntactic pattern matching
MMDialog is a large-scale multi-turn dialogue dataset containing multi-modal open-domain conversations derived from real human-human chat content in social media. MMDialog contains 1.08M dialogue sessions and 1.53M associated images. On average, one dialogue session has 2.59 images, which can be located anywhere at any conversation turn.
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has halted similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 768,826 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dat
This dataset is based on MS COCO that have 20% of data randomly shuffled to simulate noisy correspondence.
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
A temporal counterfactual dataset composing of 1000 short and natural video-caption pairs.