Datasets

3,148 machine learning datasets

3,148 dataset results

LaSCo

Large Scale Composed Image Retrieval (LaSCo) is a new dataset for Composed Image Retrieval (CoIR), x10 times larger than current ones.

7 papers1 benchmarksImages, Texts

WikiDetox (Wikipedia Detox)

An annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff) for personal attacks, aggression, and toxicity.

7 papers0 benchmarksTexts

DBLP-QuAD (DBLP Question Answering Dataset)

In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications, published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. To the best of our knowledge, this is the first QA dataset for scholarly KGs.

7 papers0 benchmarksTexts

VNHSGE (VietNamese High School Graduation Examination Dataset for Large Language Models)

The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t

7 papers0 benchmarksImages, Texts

ViHOS (Hate Speech Spans Detection for Vietnamese)

The first human-annotated corpus containing 26k spans on 11k comments

7 papers0 benchmarksTexts

NExT-QA (Open-ended VideoQA)

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. This page records LLMs for answer evaluation.

7 papers2 benchmarksTexts, Videos

SME (Standard Multimodal Explanation)

SME is a new dataset for Multi-modal Explanation for Visual Question Answering comprising 1,028,230 samples, with 1,656 visual objects requiring detection in explanations. To our knowledge, this is the first dataset where the explanations are in standard English with additional visual grounding tokens.

7 papers24 benchmarksImages, Texts

L+M-24

Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the L+M-24 dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, L+M-24 is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction

7 papers6 benchmarksBiomedical, Graphs, Texts

UPAR (Unified Pedestrian Attribute Recognition)

The Task: The challenge will use an extension of the UPAR Dataset [1], which consists of images of pedestrians annotated for 40 binary attributes. For deployment and long-term use of machine-learning algorithms in a surveillance context, the algorithms must be robust to domain gaps that occur when the environment changes. This challenge aims to spotlight the problem of domain gaps in a real-world surveillance context and highlight the challenges and limitations of existing methods to provide a direction for future research.

7 papers4 benchmarksImages, Texts

Long-RVOS

This work proposes Long-RVOS, a large-scale benchmark for long-term video object segmentation. Long-RVOS is the first minute-level dataset in the RVOS field, designed to tackle various realistic long-video challenges such as frequent occlusion, disappearance-reappearance, and shot changing. Notably, Long-RVOS offers significantly longer video duration than existing datasets. In addition, it contains the largest number of object classes and mask annotations. The large scale of Long-RVOS supports comprehensive training and evaluation of RVOS models. Finally, we gather 24,689 high-quality descriptions for building Long-RVOS.

7 papers6 benchmarksTexts, Videos

Spider2-V

A multimodal agent benchmark on professional data science and engineering. * 494 real-world tasks, ranging from data warehousing to orchestration; * 20 professional enterprise-level applications (e.g., BigQuery, dbt, Airbyte, etc.); * both command line (CLI) and graphical user interfaces (GUI); * an interactive executable computer environment; * a document warehouse for agent retrieval.

7 papers0 benchmarksEnvironment, Images, Interactive, Texts

k-qa (K-QA: A Real-World Medical Q&A Benchmark)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

7 papers0 benchmarksTexts

AQUAINT

The AQUAINT Corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmark evaluations conducted by National Institute of Standards and Technology (NIST).

6 papers1 benchmarksTexts

LCQMC (Large-scale Chinese Question Matching Corpus)

LCQMC is a large-scale Chinese question matching corpus. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. The corpus contains 260,068 question pairs with manual annotation.

6 papers0 benchmarksTexts

BC4CHEMD (BioCreative IV Chemical compound and drug name recognition)

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles

6 papers1 benchmarksTexts

SherLIiC

SherLIiC is a testbed for lexical inference in context (LIiC), consisting of 3985 manually annotated inference rule candidates (InfCands), accompanied by (i) ~960k unlabeled InfCands, and (ii) ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09. Each InfCand consists of one of these relations, expressed as a lemmatized dependency path, and two argument placeholders, each linked to one or more Freebase types.

6 papers0 benchmarksTexts

DebateSum

DebateSum consists of 187328 debate documents, arguments (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, citations, and associated metadata organized by topic-year. This data is ready for analysis by NLP systems.

6 papers2 benchmarksTexts

SemEval-2014 Task-10

SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:

6 papers0 benchmarksTexts

MEDIA

The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).

6 papers0 benchmarksAudio, Texts

OMICS (Open Mind Indoor Common Sense)

OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.

6 papers0 benchmarksTexts

PreviousPage 55 of 158Next