Datasets

3,148 machine learning datasets

3,148 dataset results

SciCite

SciCite is a dataset of citation intents that addresses multiple scientific domains and is more than five times larger than ACL-ARC.

40 papers5 benchmarksTexts

Contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.

40 papers0 benchmarksTexts

BLURB (Biomedical Language Understanding and Reasoning Benchmark)

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, comprehensive benchmarks and leaderboards such as GLUE have greatly accelerated progress in open-domain NLP. In biomedicine, however, such resources are ostensibly scarce. In the past, there have been a plethora of shared tasks in biomedical NLP, such as BioCreative, BioNLP Shared Tasks, SemEval, and BioASQ, to name just a few. These efforts have played a significant role in fueling interest and progress by the research community, but they typically focus on individual tasks. The advent of neural language models such as BERTs provides a unifying foundation to leverage transfer learning from unlabeled text to support a wide range of NLP applications. To accelerate progress in biomedical pretraining strategies and task-specific methods, it is thus imperative to create a broad-coverage benchmark encompassing diverse biomedical tasks.

40 papers3 benchmarksBiomedical, Texts

Samanantar

Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.

40 papers0 benchmarksImages, Speech, Texts

RedCaps

RedCaps is a large-scale dataset of 12M image-text pairs collected from Reddit. Images and captions from Reddit depict and describe a wide variety of objects and scenes. The data is collected from a manually curated set of subreddits (350 total), which give coarse image labels and allow steering of the dataset composition without labeling individual instances.

40 papers0 benchmarksImages, Texts

TheoremQA

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

40 papers1 benchmarksImages, Texts

MemeTracker

The Memetracker corpus contains articles from mainstream media and blogs from August 1 to October 31, 2008 with about 1 million documents per day. It has 10,967 hyperlink cascades among 600 media sites.

39 papers2 benchmarksTexts

BREAK

Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.

39 papers0 benchmarksTexts

WikiMovies

WikiMovies is a dataset for question answering for movies content. It contains ~100k questions in the movie domain, and was designed to be answerable by using either a perfect KB (based on OMDb),

39 papers0 benchmarksTexts

BookSum

BookSum is a collection of datasets for long-form narrative summarization. This dataset covers source documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of this dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures.

39 papers6 benchmarksTexts

ALCE (Automatic LLMs' Citation Evaluation)

ALCE is a benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations.

39 papers0 benchmarksTexts

BIOSSES (Biomedical Semantic Similarity Estimation System)

The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "TAC2 Biomedical Summarization Track Training Data Set" .

38 papers5 benchmarksMedical, Texts

DeepFix

DeepFix consists of a program repair dataset (fix compiler errors in C programs). It enables research around automatically fixing programming errors using deep learning.

38 papers2 benchmarksTexts

ChID (Chinese IDiom dataset)

ChID is a large-scale Chinese IDiom dataset for cloze test. ChID contains 581K passages and 729K blanks, and covers multiple domains. In ChID, the idioms in a passage were replaced with blank symbols. For each blank, a list of candidate idioms including the golden idiom are provided as choice.

38 papers0 benchmarksTexts

Tatoeba

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase “tatoeba” (例えば), meaning “for example”. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans.

38 papers0 benchmarksTexts

InsuranceQA

InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.

38 papers0 benchmarksTexts

CUAD (Contract Understanding Atticus Dataset)

Contract Understanding Atticus Dataset (CUAD) is a dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The task is to highlight salient portions of a contract that are important for a human to review.

38 papers0 benchmarksTexts

AISHELL-3

AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody.

38 papers0 benchmarksSpeech, Texts

ConvFinQA (Conversational Finance Question Answering)

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 conversations containing 14115 questions where 2715 of the conversations are simple conversations, and the rest 1,177 are hybrid conversations.

38 papers4 benchmarksTexts

SQA (SequentialQA)

The SQA dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total.

37 papers2 benchmarksTexts

PreviousPage 22 of 158Next

Datasets

SciCite

CSQA

BLURB (Biomedical Language Understanding and Reasoning Benchmark)

Samanantar

RedCaps

TheoremQA

MemeTracker

BREAK

WikiMovies

BookSum

ALCE (Automatic LLMs' Citation Evaluation)

BIOSSES (Biomedical Semantic Similarity Estimation System)

DeepFix

ChID (Chinese IDiom dataset)

Tatoeba

InsuranceQA

CUAD (Contract Understanding Atticus Dataset)

AISHELL-3

ConvFinQA (Conversational Finance Question Answering)

SQA (SequentialQA)

Datasets

SciCite

CSQA

BLURB (Biomedical Language Understanding and Reasoning Benchmark)

Samanantar

RedCaps

TheoremQA

MemeTracker

BREAK

WikiMovies

BookSum

ALCE (Automatic LLMs' Citation Evaluation)

BIOSSES (Biomedical Semantic Similarity Estimation System)

DeepFix

ChID (Chinese IDiom dataset)

Tatoeba

InsuranceQA

CUAD (Contract Understanding Atticus Dataset)

AISHELL-3

ConvFinQA (Conversational Finance Question Answering)

SQA (SequentialQA)