Datasets

19,997 machine learning datasets

19,997 dataset results

Terms of Service

The Terms of Service dataset is a law dataset corresponding to the task of identifying whether contractual terms are potentially unfair. This is a binary classification task, where positive examples are potentially unfair contractual terms (clauses) from the terms of service in consumer contracts. Article 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts defines an unfair contractual term as follows. A contractual term is unfair if: (1) it has not been individually negotiated; and (2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. The Terms of Service dataset consists of 9,414 examples.

21 papers2 benchmarksTexts

KLUE (Korean Language Understanding Evaluation)

Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.

21 papers0 benchmarksTexts

SICAPv2

SICAPv2 is a database containing prostate histology whole slide images with both annotations of global Gleason scores and path-level Gleason grades.

21 papers0 benchmarksBiomedical, Images, Medical

COVID-Fact

COVID-Fact is a FEVER-like dataset of claims concerning the COVID-19 pandemic. The dataset contains claims, evidence for the claims, and contradictory claims refuted by the evidence.

21 papers0 benchmarksTexts

German Credit Dataset

Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".

21 papers0 benchmarksTabular

EPHOIE (phtnsantader@gmail.com)

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances.

21 papers2 benchmarksImages

JARVIS-DFT

JARVIS-DFT is a repository of density functional theory based calculation data for materials.

21 papers2 benchmarks

EigenWorms

Caenorhabditis elegans is a roundworm commonly used as a model organism in the study of genetics. The movement of these worms is known to be a useful indicator for understanding behavioural genetics. Brown {\em et al.}[1] describe a system for recording the motion of worms on an agar plate and measuring a range of human-defined features[2]. It has been shown that the space of shapes Caenorhabditis elegans adopts on an agar plate can be represented by combinations of six base shapes, or eigenworms. Once the worm outline is extracted, each frame of worm motion can be captured by six scalars representing the amplitudes along each dimension when the shape is projected onto the six eigenworms. Using data collected for the work described in[1], we address the problem of classifying individual worms as wild-type or mutant based on the time series. The data were extracted from the C. elegans behavioural database [3]. We have 259 cases, which we split 131 train and 128 test. We have truncated e

21 papers1 benchmarks

ICBHI Respiratory Sound Database (The Respiratory Sound database - ICBHI 2017 Challenge)

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics - ICBHI 2017.

21 papers7 benchmarksAudio, Biomedical, Medical

DigiFace-1M

DigiFace-1M is a synthetic dataset for face recognition, obtained by rendering digital faces using a computer graphics pipeline. It contains 1.22M images of 110K unique identities. The dataset consists of two parts. The first part contains 720K images with 10K identities. For each identity, 4 different sets of accessories are sampled and 18 images are rendered for each set. The second part contains 500K images with 100K identities. For each identity, only one set of accessories is sampled and only 5 images are rendered. Following the format of the existing datasets, we provide the aligned crop around the face, resized into $112 \times 112$ resolution.

21 papers0 benchmarksImages

PLOS (Scientific Lay Summarization)

This dataset contains 27,525 full biomedical articles paired with non-technical lay summaries derived from various journals published by the Public Library of Science (PLOS).

21 papers8 benchmarks

ImageNet-1k vs Places

A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while Places is out-of-distribution.

21 papers2 benchmarksImages

SMID (Seeing motion in the dark)

This is the low-light image enhancement dataset collected by the CVPR 2018 paper "Seeing Motion in the Dark".

21 papers1 benchmarks

STREUSLE

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank [9]. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) [1] and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, as described in [2], and prepositional/possessive expressions, as described in [3, 4, 5, 6, 7, 8]. Lexical expressions also feature a lexical category label indicating its holistic grammatical status; for verbal multiword expressions, these labels incorporate categories from the PARSEME 1.1 guidelines [15]. For each token, these pieces of information are concatenated together into a lextag: a sentence's words and their lextags are sufficient to recover lexical categories, supersenses, and multiword expressions [8].

21 papers4 benchmarksTexts

Covid-19 radiography database

A team of researchers from Qatar University, Doha, Qatar, and the University of Dhaka, Bangladesh along with their collaborators from Pakistan and Malaysia in collaboration with medical doctors have created a database of chest X-ray images for COVID-19 positive cases along with Normal and Viral Pneumonia images. This COVID-19, normal, and other lung infection dataset is released in stages. In the first release, we have released 219 COVID-19, 1341 normal, and 1345 viral pneumonia chest X-ray (CXR) images. In the first update, we have increased the COVID-19 class to 1200 CXR images. In the 2nd update, we have increased the database to 3616 COVID-19 positive cases along with 10,192 Normal, 6012 Lung Opacity (Non-COVID lung infection), and 1345 Viral Pneumonia images and corresponding lung masks. We will continue to update this database as soon as we have new x-ray images for COVID-19 pneumonia patients.

21 papers0 benchmarksImages

MMT-Bench

MMT-Bench is a comprehensive benchmark designed to evaluate Large Vision-Language Models (LVLMs) across a wide array of multimodal tasks that require expert knowledge as well as deliberate visual recognition, localization, reasoning, and planning¹. It includes 31,325 meticulously curated multi-choice visual questions from various scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding¹.

21 papers0 benchmarks

STARSS23 (STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events)

The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.

21 papers0 benchmarksAudio, RGB Video

CDD-11 (Composite Degradation Dataset 11)

An image restoration dataset

21 papers4 benchmarksImages

iLIDS-VID

The iLIDS-VID dataset is a person re-identification dataset which involves 300 different pedestrians observed across two disjoint camera views in public open space. It comprises 600 image sequences of 300 distinct individuals, with one pair of image sequences from two camera views for each person. Each image sequence has variable length ranging from 23 to 192 image frames, with an average number of 73. The iLIDS-VID dataset is very challenging due to clothing similarities among people, lighting and viewpoint variations across camera views, cluttered background and random occlusions.

20 papers5 benchmarksImages, Videos

SUBJ (Subjectivity dataset)

Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

20 papers1 benchmarksTexts

PreviousPage 101 of 1000Next