Datasets

3,148 machine learning datasets

3,148 dataset results

ICDAR 2017

ICDAR2017 is a dataset for scene text detection.

VAST consists of a large range of topics covering broad themes, such as politics (e.g., ‘a Palestinian state’), education (e.g., ‘charter schools’), and public health (e.g., ‘childhood vaccination’). In addition, the data includes a wide range of similar expressions (e.g., ‘guns on campus’ versus ‘firearms on campus’). This variation captures how humans might realistically describe the same topic and contrasts with the lack of variation in existing datasets.

18 papers1 benchmarksImages, Texts

DWIE (Deutsche Welle corpus for Information Extraction)

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.

18 papers6 benchmarksTexts

InfoTabS

InfoTabS comprises of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes.

18 papers0 benchmarksTexts

KorNLI

KorNLI is a Korean Natural Language Inference (NLI) dataset. The dataset is constructed by automatically translating the training sets of the SNLI, XNLI and MNLI datasets. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. It contains 942,854 training examples translated automatically and 7,500 evaluation (development and test) examples translated manually

18 papers0 benchmarksTexts

MED (Monotonicity Entailment Dataset)

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and collected from linguistics publications. The dataset was constructed by collecting naturally-occurring examples by crowdsourcing and well-designed ones from linguistics publications. It consists of 5,382 examples.

18 papers1 benchmarksTexts

MMD (Multimodal Dialogs)

The MMD (MultiModal Dialogs) dataset is a dataset for multimodal domain-aware conversations. It consists of over 150K conversation sessions between shoppers and sales agents, annotated by a group of in-house annotators using a semi-automated manually intense iterative process.

18 papers0 benchmarksImages, Texts

United Nations Parallel Corpus

The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.

18 papers0 benchmarksTexts

Violin (VIdeO-and-Language INference)

Video-and-Language Inference is the task of joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. The Violin dataset is a dataset for this task which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels.

18 papers0 benchmarksImages, Texts

CLEVR-Humans

We collect a new dataset of human-posed free-form natural language questions about CLEVR images. Many of these questions have out-of-vocabulary words and require reasoning skills that are absent from our model’s repertoire

18 papers1 benchmarksImages, Texts

xSID (Cross-lingual Slot and Intent Detection)

xSID, a new evaluation benchmark for cross-lingual (X) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect, covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st).

18 papers0 benchmarksTexts

Enron Emails

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

18 papers3 benchmarksTexts

X-CSQA

X-CSQA is a multilingual dataset for Commonsense reasoning research, based on CSQA.

18 papers0 benchmarksTexts

EXTREME CLASSIFICATION (Extreme Multi-label Classification)

The objective in extreme multi-label classification is to learn feature architectures and classifiers that can automatically tag a data point with the most relevant subset of labels from an extremely large label set. This repository provides resources that can be used for evaluating the performance of extreme multi-label algorithms including datasets, code, and metrics.

18 papers0 benchmarksTexts

SCICAP

SCICAP is a large-scale image captioning dataset that contains real-world scientific figures and captions. SCICAP was constructed using more than two million images from over 290,000 papers collected and released by arXiv.

18 papers1 benchmarksImages, Texts

Weibo21

Weibo21 is a benchmark of fake news dataset for multi-domain fake news detection (MFND) with domain label annotated, which consists of 4,488 fake news and 4,640 real news from 9 different domains.

18 papers0 benchmarksTexts

CMB (Comprehensive Medical Benchmark in Chinese)

CMB is a comprehensive, multi-level Medical Benchmark in Chinese. It encompasses 280,839 multiple-choice questions and 74 complex case consultation questions, covering all clinical medical specialties and various professional levels. The platform aims to holistically evaluate a model's medical knowledge and clinical consultation capabilities.

18 papers0 benchmarksTexts

SWE-bench-lite

SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

18 papers0 benchmarksTexts

TURINGBENCH

TuringBench is a benchmark environment that contains :

18 papers0 benchmarksTexts

SALAD-Bench (A Hierarchical and Comprehensive Safety Benchmark for Large Language Models)

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile this http URL-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive

18 papers0 benchmarksTexts

PreviousPage 34 of 158Next

Datasets

ICDAR 2017

VAST (VAried Stance Topics)

DWIE (Deutsche Welle corpus for Information Extraction)

InfoTabS

KorNLI

MED (Monotonicity Entailment Dataset)

MMD (Multimodal Dialogs)

United Nations Parallel Corpus

Violin (VIdeO-and-Language INference)

CLEVR-Humans

xSID (Cross-lingual Slot and Intent Detection)

Enron Emails

X-CSQA

EXTREME CLASSIFICATION (Extreme Multi-label Classification)

SCICAP

Weibo21

CMB (Comprehensive Medical Benchmark in Chinese)

SWE-bench-lite

TURINGBENCH

SALAD-Bench (A Hierarchical and Comprehensive Safety Benchmark for Large Language Models)

Datasets

ICDAR 2017

VAST (VAried Stance Topics)

DWIE (Deutsche Welle corpus for Information Extraction)

InfoTabS

KorNLI

MED (Monotonicity Entailment Dataset)

MMD (Multimodal Dialogs)

United Nations Parallel Corpus

Violin (VIdeO-and-Language INference)

CLEVR-Humans

xSID (Cross-lingual Slot and Intent Detection)

Enron Emails

X-CSQA

EXTREME CLASSIFICATION (Extreme Multi-label Classification)

SCICAP

Weibo21

CMB (Comprehensive Medical Benchmark in Chinese)

SWE-bench-lite

TURINGBENCH

SALAD-Bench (A Hierarchical and Comprehensive Safety Benchmark for Large Language Models)