Datasets

3,148 machine learning datasets

3,148 dataset results

GRIT (General Robust Image Task Benchmark)

The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:

16 papers9 benchmarksImages, Texts

TripClick

TripClick is a large-scale dataset of click logs in the health domain, obtained from user interactions of the Trip Database health web search engine.

16 papers0 benchmarksTexts

FaithDial

FaithDial is a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark.

16 papers0 benchmarksTexts

HumanEval-X

HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.

16 papers0 benchmarksTexts

ToolQA

ToolQA is a question answering benchmark for Large Language Models (LLMs) which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. The development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions.

16 papers0 benchmarksTexts

SONAR

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

16 papers0 benchmarksAudio, Speech, Texts

GeneCIS

GeneCIS benchmark is designed for measuring models’ ability to adapt to a range of similarity conditions, which is zero-shot evaluation only.

16 papers4 benchmarksImages, Texts

MeetingBank

MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets.

16 papers3 benchmarksAudio, Texts, Videos

TweepFake

The TweepFake dataset consists of 25,572 social media messages posted either by bots or humans on Twitter. Each bot imitated a human account and was based on various generative techniques, including Markov Chains, RNN, RNN+Markov, LSTM, and GPT-2.

16 papers2 benchmarksTexts

DUC 2004

The DUC2004 dataset is a dataset for document summarization. Is designed and used for testing only. It consists of 500 news articles, each paired with four human written summaries. Specifically it consists of 50 clusters of Text REtrieval Conference (TREC) documents, from the following collections: AP newswire, 1998-2000; New York Times newswire, 1998-2000; Xinhua News Agency (English version), 1996-2000. Each cluster contained on average 10 documents.

15 papers6 benchmarksTexts

CrossNER

CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

15 papers5 benchmarksTexts

CMU DoG (CMU Document Grounded Conversations Dataset)

This is a document grounded dataset for text conversations. "Document Grounded Conversations" are conversations that are about the contents of a specified document. In this dataset the specified documents are Wikipedia articles about popular movies. The dataset contains 4112 conversations with an average of 21.43 turns per conversation.

15 papers0 benchmarksTexts

CovidQA

The beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.

15 papers0 benchmarksTexts

Fakeddit

Fakeddit is a novel multimodal dataset for fake news detection consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision.

15 papers0 benchmarksImages, Texts

Humicroedit

Humicroedit is a humorous headline dataset. The data consists of regular English news headlines paired with versions of the same headlines that contain simple replacement edits designed to make them funny. The authors carefully curated crowdsourced editors to create funny headlines and judges to score a to a total of 15,095 edited headlines, with five judges per headline.

15 papers0 benchmarksTexts

QuerYD

A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.

15 papers6 benchmarksAudio, Texts, Videos

Opusparcus

Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

15 papers0 benchmarksTexts

Project CodeNet

Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

15 papers0 benchmarksTexts

FaVIQ (Fact Verification from Information-seeking Questions)

FaVIQ (Fact Verification from Information-seeking Questions) is a challenging and realistic fact verification dataset that reflects confusions raised by real users. We use the ambiguity in information-seeking questions and their disambiguation, and automatically convert them to true and false claims. These claims are natural, and require a complete understanding of the evidence for verification. FaVIQ serves as a challenging benchmark for natural language understanding, and improves performance in professional fact checking.

15 papers0 benchmarksTexts

PubTables-1M (PubMed Tables One Million)

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:

15 papers0 benchmarksImages, Texts

PreviousPage 37 of 158Next