3,148 machine learning datasets
3,148 dataset results
Large Scale Composed Image Retrieval (LaSCo) is a new dataset for Composed Image Retrieval (CoIR), x10 times larger than current ones.
An annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff) for personal attacks, aggression, and toxicity.
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications, published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. To the best of our knowledge, this is the first QA dataset for scholarly KGs.
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t
The first human-annotated corpus containing 26k spans on 11k comments
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. This page records LLMs for answer evaluation.
SME is a new dataset for Multi-modal Explanation for Visual Question Answering comprising 1,028,230 samples, with 1,656 visual objects requiring detection in explanations. To our knowledge, this is the first dataset where the explanations are in standard English with additional visual grounding tokens.
Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the L+M-24 dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, L+M-24 is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction
The Task: The challenge will use an extension of the UPAR Dataset [1], which consists of images of pedestrians annotated for 40 binary attributes. For deployment and long-term use of machine-learning algorithms in a surveillance context, the algorithms must be robust to domain gaps that occur when the environment changes. This challenge aims to spotlight the problem of domain gaps in a real-world surveillance context and highlight the challenges and limitations of existing methods to provide a direction for future research.
This work proposes Long-RVOS, a large-scale benchmark for long-term video object segmentation. Long-RVOS is the first minute-level dataset in the RVOS field, designed to tackle various realistic long-video challenges such as frequent occlusion, disappearance-reappearance, and shot changing. Notably, Long-RVOS offers significantly longer video duration than existing datasets. In addition, it contains the largest number of object classes and mask annotations. The large scale of Long-RVOS supports comprehensive training and evaluation of RVOS models. Finally, we gather 24,689 high-quality descriptions for building Long-RVOS.
A multimodal agent benchmark on professional data science and engineering. * 494 real-world tasks, ranging from data warehousing to orchestration; * 20 professional enterprise-level applications (e.g., BigQuery, dbt, Airbyte, etc.); * both command line (CLI) and graphical user interfaces (GUI); * an interactive executable computer environment; * a document warehouse for agent retrieval.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The AQUAINT Corpus consists of newswire text data in English, drawn from three sources: the Xinhua News Service (People's Republic of China), the New York Times News Service, and the Associated Press Worldstream News Service. It was prepared by the LDC for the AQUAINT Project, and will be used in official benchmark evaluations conducted by National Institute of Standards and Technology (NIST).
LCQMC is a large-scale Chinese question matching corpus. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. The corpus contains 260,068 question pairs with manual annotation.
Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles
SherLIiC is a testbed for lexical inference in context (LIiC), consisting of 3985 manually annotated inference rule candidates (InfCands), accompanied by (i) ~960k unlabeled InfCands, and (ii) ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09. Each InfCand consists of one of these relations, expressed as a lemmatized dependency path, and two argument placeholders, each linked to one or more Freebase types.
DebateSum consists of 187328 debate documents, arguments (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, citations, and associated metadata organized by topic-year. This data is ready for analysis by NLP systems.
SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:
The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).
OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.