TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

Code Lingua

Code Lingua is a benchmark to compare the ability of language models to understand what the code implements in the source language and translate the same semantics in the target language. It comprises 1,700 code samples across five programming languages, over 10,000 tests, 43,000 translated code snippets, 1,748 manually labeled bugs, and 1,365 bug-fix pairs.

3 papers0 benchmarks

OOP

The OOP benchmark features 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods¹². The authors of the paper argue that current evaluation frameworks largely neglect OOP in favor of functional programming (FP), such as HumanEval and MBPP¹. To address this, they introduced this OOP-focused benchmark¹².

3 papers0 benchmarks

Diffusion4D

Diffusion4D is a large-scale, high-quality dynamic 3D(4D) dataset sourced from the vast 3D data corpus of Objaverse-1.0 and Objaverse-XL. We apply a series of empirical rules to filter the dataset. You can find more details in our paper. In this part, we will release the selected 4D assets, including:

3 papers0 benchmarks

KIEval

KIEval provides a robust framework for dynamic, interactive evaluation of large language models, reducing the impact of data contamination and offering deeper insights into a model's true capabilities. It shifts the focus from static evaluation to a more comprehensive assessment of knowledge understanding and application. How KIEval Works:

3 papers0 benchmarks

CATT (CATT Arabic Diacritization Benchmark Dataset)

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple topics including science and technology, economics, politics, sports, arts, and culture. It was manually diacritized by two expert native Arabic speakers and then validated by a third expert. This dataset contains names of people and places in both Arabic and English. As for the English names, they are written in Arabic letters and diacritized based on their pronunciation. Also, the numbers in the sentences are written in textual form rather than the numeric form which helps in evaluating the models without the need for a text normalizer (TN).

3 papers2 benchmarksTexts

OA-Mine - annotations

The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The products may have been taken down from Amazon since the collection of the dataset.

3 papers3 benchmarksTexts

WFDD (Woven Fabric Defect Detection)

WFDD is a dataset for benchmarking anomaly detection methods with a focus on textile inspection. It includes 4101 woven fabric images categorized into 4 categories: grey cloth, grid cloth, yellow cloth, and pink flower. The first three classes are collected from the industrial production sites of WEIQIAO Textile, while the 'pink flower' class is gathered from the publicly available Cloth Flaw Dataset. Each category contains block-shape, point-like, and line-type defects with pixel-level annotations.

3 papers3 benchmarksImages

TV-AD (Audio Description dataset for TV series)

TV-AD is a dataset that provides ground truth AD annotations that are aligned with TV series video, featuring episodes across multiple TV series including “The Big Bang Theory”, “Friends”, “Frasier”, “Seinfeld”, etc. The dataset is divided into training (TV-AD-Train, ∼31k ADs) and evaluation splits (TV-AD-Eval, ∼3k ADs), ensuring that the TV series do not overlap between the two splits. The evaluation split contains AD annotations for TV videos that are publicly available.

3 papers0 benchmarks

Paderbone University Bearing Fault Benckmark

This paper presents a benchmark data set for condition monitoring of rolling bearings in combination with an extensive description of the corresponding bearing damage, the data set generation by experiments and results of datadriven classifications used as a diagnostic method. The diagnostic method uses the motor current signal of an electromechanical drive system for bearing diagnostic. The advantage of this approach in general is that no additional sensors are required, as current measurements can be performed in existing frequency inverters. This will help to reduce the cost of future condition monitoring systems. A particular novelty of the present approach is the monitoring of damage in external bearings which are installed in the drive system but outside the electric motor. Nevertheless, the motor current signal is used as input for the detection of the damage. Moreover, a wide distribution of bearing damage is considered for the benchmark data set. The results of the classificat

3 papers0 benchmarksTime series

ENST Drums (ENST-Drums: an extensive audio-visual database for drum signals processing)

ENST-Drums: an extensive audio-visual database for drum signals processing Olivier Gillet and Gaël Richard GET / ENST, CNRS LTCI, 37 rue Dareau, 75014 Paris, France

3 papers0 benchmarksAudio

UZLF (Leuven-Haifa High-Resolution Fundus Image Dataset for Retinal Blood Vessel Segmentation and Glaucoma Diagnosis)

The Leuven-Haifa dataset contains 240 disc-centered fundus images of 224 unique patients (75 patients with normal tension glaucoma, 63 patients with high tension glaucoma, 30 patients with other eye diseases and 56 healthy controls) from the University Hospitals of Leuven. The arterioles and venules of these images were both annotated by master students in medicine and corrected by a senior annotator. All senior segmentation corrections are provided as well as the junior segmentations of the test set. An open-source toolbox for the parametrization of segmentations was developed. Diagnosis, age, sex, vascular parameters as well as a quality score are provided as metadata. Potential reuse is envisioned as the development or external validation of blood vessels segmentation algorithms or study of the vasculature in glaucoma and the development of glaucoma diagnosis algorithms. The dataset is available on the KU Leuven Research Data Repository (RDR).

3 papers2 benchmarksImages

ToolLens

The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out of a total of 464, designed to better mimic real-world user interactions.

3 papers1 benchmarksTexts

RLHF-V Dataset

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

3 papers0 benchmarksImages, Texts

WS353 (WordSim-353)

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414

3 papers1 benchmarks

GarmentCodeData (GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns)

GarmentCodeData contains 115,000 data points that cover a variety of designs in many common garment categories: tops, shirts, dresses, jumpsuits, skirts, pants, etc., fitted to a variety of body shapes sampled from a custom statistical body model based on CAESAR, as well as a standard reference body shape, applying three different textile materials.

3 papers0 benchmarks3D, 3d meshes

AgentCourt

550cases

3 papers0 benchmarks

MG-ShopDial

The MG-ShopDial dataset contains English conversations that mix different conversational goals, including search, recommendation, and question answering in the domain of e-commerce. The dataset includes 64 high-quality dialogues with a total of 2,196 utterances for scenarios of varying complexity. Intents and goals annotations are available on the utterance level. In addition to MG-ShopDial, the data collection tool Coached Conversation Collector is released. This tool supports the proposed coached human-human data collection protocol used for the creation of MG-ShopDial.

3 papers0 benchmarks

SensumSODF (Sensum Solid Oral Dosage Forms)

Given the unavailability of real-world pharmaceutical inspection-domain datasets, we have created the Sensum Solid Oral Dosage Forms (SensumSODF) dataset intended for research and evaluation purposes.

3 papers0 benchmarksImages

Tiny ImageNetV2

Tiny ImageNetv2 is a subset of the ImageNetV2 (matched frequency) dataset by Recht et al. ("Do ImageNet Classifiers Generalize to ImageNet?") with 2,000 images spanning all 200 classes of the Tiny ImageNet dataset. It is a test set achieved by collecting images of joint classes of Tiny ImageNet and ImageNet. The resized images of size 64×64 consist of images collected from Flickr after a decade of progress on the original ImageNet dataset. The data collection process was designed to resemble the original ImageNet dataset distribution. For further information on ImageNetV2 visit the original GitHub repository of ImageNetV2.

3 papers0 benchmarksImages

SONICS (Synthetic Or Not - Identifying Counterfeit Songs)

SONICS is a large-scale dataset comprising 97,164 songs — 48,090 real songs from YouTube and 49,074 fake songs from Suno & Udio — designed for synthetic song detection (SSD), also known as fake song detection (FSD). It addresses several limitations of existing datasets, such as the lack of end-to-end fake songs, limited diversity in music-lyrics, and insufficient long-duration songs. The average length of the songs in SONICS is 176 seconds, which enables the capture of long-context relationships. Moreover, SONICS provides open access to generated fake songs and is divided into 66,709 songs for training, 26,015 songs for testing, and 4,440 songs for validation. Additionally, the inclusion of song lyrics in SONICS dataset paves the way for future research in this field.

3 papers0 benchmarksAudio
PreviousPage 292 of 1000Next