Datasets

19,997 machine learning datasets

19,997 dataset results

TOPv2 (Task Oriented Parsing v2)

Task Oriented Parsing v2 (TOPv2) representations for intent-slot based dialog systems.

HONEST (Hurtful Sentence Completion in English Language Models)

The HONEST dataset is a template-based corpus for testing the hurtfulness of sentence completions in language models (e.g., BERT) in six different languages (English, Italian, French, Portuguese, Romanian, and Spanish). HONEST is composed of 420 instances for each language, which are generated from 28 identity terms (14 male and 14 female) and 15 templates. It uses a set of identifier terms in singular and plural (i.e., woman, women, girl, boys) and a series of predicates (i.e., “works as [MASK]”, “is known for [MASK]”). The objective is to use language models to fill the sentence, then the hurtfulness of the completion is evaluated.

25 papers1 benchmarksTexts

ESOL (Estimated SOLubility)

ESOL is a water solubility prediction dataset consisting of 1128 samples.

25 papers1 benchmarks

ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer)

The ELEVATER benchmark is a collection of resources for training, evaluating, and analyzing language-image models on image classification and object detection. ELEVATER consists of:

25 papers5 benchmarksImages, Texts

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.

25 papers2 benchmarksTexts

DIOR-RSVG

DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.

25 papers0 benchmarksImages

OASST1 (OpenAssistant Conversations Dataset)

license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations

25 papers0 benchmarksTexts

BDD-A (Berkeley DeepDrive Attention)

Dataset Statistics: The statistics of our dataset are summarized and compared with the largest existing dataset (DR(eye)VE) [1] in Table 1. Our dataset was collected using videos selected from a publicly available, large-scale, crowd-sourced driving video dataset, BDD100k [30, 31]. BDD100K contains human-demonstrated dashboard videos and time-stamped sensor measurements collected during urban driving in various weather and lighting conditions. To efficiently collect attention data for critical driving situations, we specifically selected video clips that both included braking events and took place in busy areas (see supplementary materials for technical details). We then trimmed videos to include 6.5 seconds prior to and 3.5 seconds after each braking event. It turned out that other driving actions, e.g., turning, lane switching and accelerating, were also included. 1,232 videos (=3.5 hours) in total were collected following these procedures. Some example images from our dataset are sh

25 papers0 benchmarksVideos

ScanNet++ (ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes)

ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, and commodity RGB-D streams from iPhone. The 3D reconstructions are annotated with long-tail and label-ambiguous semantics to benchmark semantic understanding methods, while the coupled DSLR and iPhone captures enable benchmarking of novel view synthesis methods in high-quality and commodity settings.

25 papers23 benchmarks

LargeST (LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting)

In this work, we propose LargeST as a new benchmark dataset (see Figure 1), with the goal of facilitating the development of accurate and efficient methods in the context of large-scale traffic forecasting. The distinguishing characteristic of LargeST lies not only in its extensive graph size, encompassing a total of 8,600 sensors in California, but also in its substantial temporal coverage and rich node information – each sensor contains 5 years of data and comprehensive metadata.

25 papers4 benchmarksTime series

MotSynth

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

25 papers0 benchmarks

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

24 papers1 benchmarksImages, Texts

DHF1K

DHF1K is a video saliency dataset which contains a ground-truth map of binary pixel-wise gaze fixation points and a continuous map of the fixation points after being blurred by a gaussian filter. DHF1K contains 1000 videos in total. 700 of the videos are annotated, 600 of which are used for training and 100 for validation. The remaining 300 are the testing set which are to be evaluated on a public server.

24 papers5 benchmarksImages, Videos

INRIA Person

The INRIA Person dataset is a dataset of images of persons used for pedestrian detection. It consists of 614 person detections for training and 288 for testing.

24 papers0 benchmarksImages

REDDIT-12K

Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.

24 papers2 benchmarksGraphs

WMT 2016 News (WMT 2016 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian a third of the test set were released as a development set instead. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models and development sets for tuning. Some training corpora were identical from WMT 2015 (Europarl, United Nations, French-English 10⁹ corpus, Common Crawl, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were update (CzEng v1.6pre, News Commentary v11, monolingual news data). Additionally,

24 papers0 benchmarksParallel, Texts

MCScript

MCScript is used as the official dataset of SemEval2018 Task11. This dataset constructs a collection of text passages about daily life activities and a series of questions referring to each passage, and each question is equipped with two answer choices. The MCScript comprises 9731, 1411, and 2797 questions in training, development, and test set respectively.

24 papers0 benchmarksTexts

TechQA (The TechQA Dataset)

TECHQA is a domain-adaptation question answering dataset for the technical support domain. The TECHQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size – 600 training, 310 dev, and 490 evaluation question/answer pairs – thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TECHQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote—a technical document that addresses a specific technical issue.

24 papers0 benchmarksTexts

ROPES (Reasoning Over Paragraph Effects in Situations)

ROPES is a QA dataset which tests a system's ability to apply knowledge from a passage of text to a new situation. A system is presented a background passage containing a causal or qualitative relation(s), a novel situation that uses this background, and questions that require reasoning about effects of the relationships in the back-ground passage in the context of the situation.

24 papers0 benchmarksTexts

ARCD

Composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD).

24 papers0 benchmarks

PreviousPage 91 of 1000Next