Datasets

19,997 machine learning datasets

19,997 dataset results

Naamapadam

Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.

6 papers0 benchmarksTexts

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

6 papers2 benchmarksTexts

RxRx1

RxRx1 is a biological dataset designed specifically for the systematic study of batch effect correction methods. The dataset consists of 125,510 high-resolution fluorescence microscopy images of human cells under 1,138 genetic perturbations in 51 experimental batches across 4 cell types.

6 papers0 benchmarksBiology, Images

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are

6 papers2 benchmarksTabular, Texts

PanopTOP31K

Starting from the Panoptic Dataset, we use the PanopTOP framework to generate the PanopTOP31K dataset, consisting of 31K images from 23 different subjects recorded from diverse and challenging viewpoints, also including the top-view.

6 papers0 benchmarks

Prophesee GEN4 Dataset (Prophesee 1 Megapixel Automotive Detection Dataset)

The dataset is split between train, test and val folders.

6 papers0 benchmarksPoint cloud, Videos

YouTube Driving

YouTube Driving Dataset contains a massive amount of real-world driving frames with various conditions, from different weather, different regions, to diverse scene types

6 papers2 benchmarks

VEDAI (Vehicle Detection in Aerial Imagery)

VEDAI is a dataset for Vehicle Detection in Aerial Imagery, provided as a tool to benchmark automatic target recognition algorithms in unconstrained environments. The vehicles contained in the database, in addition of being small, exhibit different variabilities such as multiple orientations, lighting/shadowing changes, specularities or occlusions. Furthermore, each image is available in several spectral bands and resolutions. A precise experimental protocol is also given, ensuring that the experimental results obtained by different people can be properly reproduced and compared. We also give the performance of some baseline algorithms on this dataset, for different settings of these algorithms, to illustrate the difficulties of the task and provide baseline comparisons.

6 papers5 benchmarksImages

MobileBrick

Generate high-quality 3D ground-truth shapes for reconstruction evaluation is extremely challenging because even 3D scanners can only generate pseudo ground-truth shapes with artefacts. We propose a novel data capturing and 3D annotation pipeline to obtain precise 3D ground-truth shapes without relying on expensive 3D scanners. The key to creating the precise 3D ground-truth shapes is using LEGO models, which are made of LEGO bricks with known geometry. The MobileBrick dataset provides a unique opportunity for future research on high-quality 3D reconstruction thanks to two distinctive features: 1) A large number of RGBD sequences with precise 3D ground-truth annotations. 2) The RGBD images were captured using mobile devices so algorithms can be tested in a realistic setup for mobile AR applications.

6 papers0 benchmarks3D, RGB-D

ViNLI (Vietnamese Natural Language Inference Dataset)

A large-scale and high-quality corpus is necessary for studies on NLI for Vietnamese, which can be considered a low-resource language. In this paper, we introduce ViNLI (Vietnamese Natural Language Inference), an open-domain and high-quality corpus for evaluating Vietnamese NLI models, which is created and evaluated with a strict process of quality control. ViNLI comprises over 30,000 human-annotated premise-hypothesis sentence pairs extracted from more than 800 online news articles on 13 distinct topics.

6 papers2 benchmarks

CAIS (Chinese Artificial Intelligence Speakers)

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The training, validation and test sets are split by the distribution of intents, where detailed statistics are provided in the supplementary material. Since the utterances are collected from speaker systems in the real world, intent labels are partial to the PlayMusic option. We adopt the BIOES tagging scheme for slots instead of the BIO2 used in the ATIS, since previous studies have highlighted meaningful improvements with this scheme (Ratinov and Roth, 2009) in the sequence labeling field

6 papers2 benchmarksTexts

MoocRadar

MoocRadar is a fine-grained and multiaspect knowledge repository that consists of 2,513 exercises, 5,600 concepts, and 14,224 students’ 12,715,126 behavioral records for improving cognitive student modeling in MOOCs.

6 papers0 benchmarks

HRS-Bench (Holistic, Reliable, and Scalable Benchmark)

HRS-Bench is a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. It measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes.

6 papers0 benchmarksImages, Texts

LSSED

LSSED, a challenging large-scale english dataset for speech emotion recognition. It contains 147,025 sentences (206 hours and 25 minutes in total) spoken by 820 people. Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)

6 papers2 benchmarksAudio, Texts

LIS (low-light instance segmentation)

To reveal and systematically investigate the effectiveness of the proposed method in the real world, a real low-light image dataset for instance segmentation is necessary and urgently needed. Considering there is no suitable dataset, therefore, we collect and annotate a Low-light Instance Segmentation (LIS) dataset using a Canon EOS 5D Mark IV camera.

6 papers0 benchmarksImages

SimpleQuestionsWikiData

SimpleQuestionsWikidata maps SimpleQuestions to Wikidata.

6 papers1 benchmarks

DAVIS-DTA

Dataset Description: The interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome.

6 papers3 benchmarks

DaLAJ

DaLAJ 1.0, a dataset for Linguistic Acceptability Judgments for Swedish, comprising 9,596 sentences in its first version; and the initial experiment using it for the binary classification task. DaLAJ is based on the SweLL second language learner data, consisting of essays at different levels of proficiency.

6 papers2 benchmarksTexts

ObjectFolder Real

The ObjectFolder Real dataset contains multisensory data collected from 100 real-world household objects. The visual data for each object include three high-quality 3D meshes of different resolutions and an HD video recording of the object rotating in a lightbox; The acoustic data for each object include impact sound recordings recorded at 30–50 points of the object, each of which is 6s long and is accompanied by the coordinate of the striking location on the object mesh, ground-truth contact force profile, and the accompanying video for the impact. The tactile data for each object include tactile readings at the same 30–50 points of the object, with each tactile reading as a video of the tactile RGB images that record the entire gel deformation process and is accompanied by two videos of the contact process from an in-hand camera and a third-view camera.

6 papers0 benchmarks3d meshes, Audio, Videos

AMI Meeting Corpus

The AMI Meeting Corpus is a multi-modal data set comprising 100 hours of meeting recordings. It has been meticulously curated for research purposes and includes various modes of data capture. Let me provide you with more details:

6 papers1 benchmarksTexts

PreviousPage 204 of 1000Next