Datasets

3,148 machine learning datasets

3,148 dataset results

Open Images V7

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.

6 papers0 benchmarksImages, Speech, Texts

Wild-Time is a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.

6 papers0 benchmarksImages, Texts

PLABA (Plain Language Adaptation of Biomedical Abstracts)

Plain Language Adaptation of Biomedical Abstracts (PLABA) is a dataset designed for automatic adaptation that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs.

6 papers0 benchmarksBiomedical, Texts

Naamapadam

Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.

6 papers0 benchmarksTexts

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

6 papers2 benchmarksTexts

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are

6 papers2 benchmarksTabular, Texts

CAIS (Chinese Artificial Intelligence Speakers)

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The training, validation and test sets are split by the distribution of intents, where detailed statistics are provided in the supplementary material. Since the utterances are collected from speaker systems in the real world, intent labels are partial to the PlayMusic option. We adopt the BIOES tagging scheme for slots instead of the BIO2 used in the ATIS, since previous studies have highlighted meaningful improvements with this scheme (Ratinov and Roth, 2009) in the sequence labeling field

6 papers2 benchmarksTexts

HRS-Bench (Holistic, Reliable, and Scalable Benchmark)

HRS-Bench is a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes.

6 papers0 benchmarksImages, Texts

LSSED

LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in total) spoken by 820 people. Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)

6 papers2 benchmarksAudio, Texts

DaLAJ

DaLAJ 1.0, a dataset for Linguistic Acceptability Judgments for Swedish, comprising 9,596 sentences in its first version; and the initial experiment using it for the binary classification task. DaLAJ is based on the SweLL second language learner data, consisting of essays at different levels of proficiency.

6 papers2 benchmarksTexts

AMI Meeting Corpus

The AMI Meeting Corpus is a multi-modal data set comprising 100 hours of meeting recordings. It has been meticulously curated for research purposes and includes various modes of data capture. Let me provide you with more details:

6 papers1 benchmarksTexts

SMART-101 (Simple Multimodal Algorithmic Reasoning Task Dataset)

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including pattern recognition, algebra, and spatial reasoning, among others. To train deep neural networks, we programmatically augment each puzzle to 2,000 new instances; each instance varied in appea

6 papers0 benchmarksImages, Texts

Rad-ReStruct

Rad-ReStruct is a fine-grained structured reporting dataset for Chest X-Ray images. The structured reporting process is modeled as a hierarchical VQA task and the task is recognizing different findings in different body regions and predicting their attributes.

6 papers0 benchmarksImages, Texts

SuperCLUE

**SuperCLUE is a Chinese language model evaluation benchmark named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE).

6 papers0 benchmarksTexts

UMVM

We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.

6 papers0 benchmarksGraphs, Images, Texts

RealCQA

RealCQA Scientific Chart Question Answering as a Test-bed for First-Order Logic

6 papers2 benchmarksImages, Texts

WaterBench

Multi-level Benchmark of Watermarks for Large Language Models

6 papers0 benchmarksTexts

MusicBench

The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with Mustango text-to-music model. MusicBench is based on the MusicCaps dataset, which it expands from 5,521 samples to 52,768 training and 400 test samples!

6 papers1 benchmarksAudio, Music, Texts

ShapeTalk (The ShapeTalk Dataset)

ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances

6 papers0 benchmarksImages, Texts

WebLINX (Real-World Website Navigation with Multi-Turn)

WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.

6 papers4 benchmarksActions, Images, RGB Video, Ranking, Texts, Videos

PreviousPage 59 of 158Next