Datasets

3,148 machine learning datasets

3,148 dataset results

TweetQA

With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure the tweets are meaningful and contain interesting information, tweets used by journalists to write news articles are gathered. Then human annotators are asked to write questions and answers upon these tweets. Unlike other QA datasets like SQuAD in which the answers are extractive, the answer are allowed to be abstractive. The task requires model to read a short tweet and a question and outputs a text phrase (does not need to be in the tweet) as the answer.

19 papers2 benchmarksTexts

Qulac

A dataset on asking Questions for Lack of Clarity in open-domain information-seeking conversations. Qulac presents the first dataset and offline evaluation framework for studying clarifying questions in open-domain information-seeking conversational search systems.

19 papers0 benchmarksTexts

Torque

Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.

19 papers3 benchmarksTexts

Touchdown Dataset

Touchdown is a corpus for executing navigation instructions and resolving spatial descriptions in visual real-world environments. The task is to follow instruction to a goal position and there find a hidden object, Touchdown the bear.

19 papers1 benchmarksTexts

Taskmaster-1

Taskmaster-1 is a dialog dataset consisting of 13,215 task-based dialogs in English, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.

19 papers0 benchmarksDialog, Texts

Funcom

Funcom is a collection of ~2.1 million Java methods and their associated Javadoc comments. This data set was derived from a set of 51 million Java methods and only includes methods that have an associated comment, comments that are in the English language, and has had auto-generated files removed. Each method/comment pair also has an associated method_uid and project_uid so that it is easy to group methods by their parent project.

19 papers0 benchmarksTexts

BiasBios (Bias in Bios)

The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to find the names, pronouns, and occupations. Twenty-eight most frequent occupations were identified based on their appearances. The resulting dataset consists of 397,340 biographies spanning twenty-eight different occupations. Of these occupations, the professor is the most frequent, with 118,400 biographies, while the rapper is the least frequent, with 1,406 biographies. Important information about the biographies: 1. The longest biography is 194 tokens, while the shortest is eighteen; the median biography length is seventy-two tokens. 2. It should be noted that the demographics of online biographies’ subjects differ from those of the overall workforce and that this dataset does not contain all biographies on the Internet.

19 papers1 benchmarksTexts

SUTD-TrafficQA

SUTD-TrafficQA (Singapore University of Technology and Design - Traffic Question Answering) is a dataset which takes the form of video QA based on 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, the dataset proposes 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events.

19 papers2 benchmarksTexts, Videos

ChFinAnn

Ten years (2008-2018) ChFinAnn documents and human-summarized event knowledge bases to conduct the DS-based event labeling. Five event types included: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO) and Equity Pledge (EP), which belong to major events required to be disclosed by the regulator and may have a huge impact on the company value. To ensure the labeling quality, the authors set constraints for matched document-record pairs.

19 papers1 benchmarksTexts

Mr. TYDI

Mr. TyDi is a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data.

19 papers0 benchmarksTexts

UMLS (Unified Medical Language System)

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:

19 papers2 benchmarksGraphs, Texts

CoAuthor

CoAuthor is a dataset designed for revealing GPT-3's capabilities in assisting creative and argumentative writing. CoAuthor captures rich interactions between 63 writers and four instances of GPT-3 across 1445 writing sessions.

19 papers1 benchmarksTexts

FoCus (Call for Customized Conversation: Customized Conversation Grounding Persona and Knowledge)

We introduce a new dataset, called FoCus, that supports knowledge-grounded answers that reflect user’s persona. One of the situations in which people need different types of knowledge, based on their preferences, occurs when they travel around the world.

19 papers0 benchmarksTexts

QAMPARI

QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.

19 papers0 benchmarksTexts

DDXPlus (DDXPlus: A New Dataset For Automatic Medical Diagnosis)

There has been a rapidly growing interest in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence about their symptoms and relevant antecedents, and possibly make predictions about the underlying diseases. Doctors would review the interactions, including the evidence and the predictions, collect if necessary additional information from patients, before deciding on next steps. Despite recent progress in this area, an important piece of doctors' interactions with patients is missing in the design of these systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset of roughly 1.3 million patients that includes a differential diagnosis, along with the ground truth

19 papers0 benchmarksMedical, Texts

OVEN (Open-domain Visual Entity Recognition)

In this project, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels.

19 papers1 benchmarksImages, Texts

WanJuan

WanJuan is a large-scale training corpus that includes multiple modalities. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB.

19 papers0 benchmarksImages, Texts, Videos

2010 i2b2/VA

2010 i2b2/VA is a biomedical dataset for relation classification and entity typing.

18 papers7 benchmarksTexts

TempEval-3 (TempEval-3: events, times, and temporal relations)

Within the SemEval-2013 evaluation exercise, the TempEval-3 shared task aims to advance research on temporal information processing. It follows on from TempEval-1 and -2, with: a three-part structure covering temporal expression, event, and temporal relation extraction; a larger dataset; and new single measures to rank systems – in each task and in general.

18 papers27 benchmarksTexts

Amazon Beauty (Amazon Beauty 5-core)

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

18 papers8 benchmarksImages, Texts

PreviousPage 33 of 158Next