3,148 machine learning datasets
3,148 dataset results
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
Wild-Time is a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.
Plain Language Adaptation of Biomedical Abstracts (PLABA) is a dataset designed for automatic adaptation that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs.
Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.
Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.
WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are
We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The training, validation and test sets are split by the distribution of intents, where detailed statistics are provided in the supplementary material. Since the utterances are collected from speaker systems in the real world, intent labels are partial to the PlayMusic option. We adopt the BIOES tagging scheme for slots instead of the BIO2 used in the ATIS, since previous studies have highlighted meaningful improvements with this scheme (Ratinov and Roth, 2009) in the sequence labeling field
HRS-Bench is a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes.
LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in total) spoken by 820 people. Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)
DaLAJ 1.0, a dataset for Linguistic Acceptability Judgments for Swedish, comprising 9,596 sentences in its first version; and the initial experiment using it for the binary classification task. DaLAJ is based on the SweLL second language learner data, consisting of essays at different levels of proficiency.
The AMI Meeting Corpus is a multi-modal data set comprising 100 hours of meeting recordings. It has been meticulously curated for research purposes and includes various modes of data capture. Let me provide you with more details:
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including pattern recognition, algebra, and spatial reasoning, among others. To train deep neural networks, we programmatically augment each puzzle to 2,000 new instances; each instance varied in appea
Rad-ReStruct is a fine-grained structured reporting dataset for Chest X-Ray images. The structured reporting process is modeled as a hierarchical VQA task and the task is recognizing different findings in different body regions and predicting their attributes.
**SuperCLUE is a Chinese language model evaluation benchmark named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE).
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.
RealCQA Scientific Chart Question Answering as a Test-bed for First-Order Logic
Multi-level Benchmark of Watermarks for Large Language Models
The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with Mustango text-to-music model. MusicBench is based on the MusicCaps dataset, which it expands from 5,521 samples to 52,768 training and 400 test samples!
ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances
WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.