TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

FLIP (Fitness Landscape Inference for Proteins)

FLIP includes several benchmark datasets that contain a variety of protein sequences, each with a real-valued label indicating its "fitness" (how well the protein performs some particular function). The goal is to predict the fitness of a given protein sequence using the sequence. Different representations of protein sequences (e.g. learned embeddings from large language models) may prove helpful here.

11 papers0 benchmarksBiology

MoisesDB

In this paper, we introduce the MoisesDB dataset for musical source separation. It consists of 240 tracks from 45 artists, covering twelve musical genres. For each song, we provide its individual audio sources, organized in a two-level hierarchical taxonomy of stems. This will facilitate building and evaluating fine-grained source separation systems that go beyond the limitation of using four stems (drums, bass, other, and vocals) due to lack of data. To facilitate the adoption of this dataset, we publish an easy-to-use Python library to download, process and use MoisesDB. Alongside a thorough documentation and analysis of the dataset contents, this work provides baseline results for open-source separation models for varying separation granularities (four, five, and six stems), and discuss their results.

11 papers0 benchmarks

LOL-v2-synthetic (From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement)

From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement

11 papers3 benchmarks

Mini Crosswords

We scrape data from GooBix, which contains 156 games of 5 × 5 mini crosswords. The goal is not just to solve the task, as more general crosswords can be readily solved with specialized NLP pipelines that leverage large-scale retrieval instead of LM. Rather, we aim to explore the limit of LM as a general problem solver that explores its own thoughts and guides its own exploration with deliberate reasoning as heuristics.

11 papers0 benchmarksTexts

Casia V1+

Casia V1 is a dataset for forgery classification. Casia V1+ is a modification of the Casia V1 dataset proposed by Chen et al. that replaces authentic images that also exist in Casiav2 with images from the COREL dataset to avoid data contamination.

11 papers19 benchmarksImages

COST (COCO Segmentation Text)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

11 papers0 benchmarksImages, Texts

ConceptARC

The ConceptARC dataset is a benchmark for evaluating understanding and generalization in the Abstraction and Reasoning Corpus (ARC) domain. It was developed by Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ability to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems.

11 papers0 benchmarks

LOL-Blur (low-blur/high-sharp-scaled)

A total of 170 videos for training and 30 videos for testing, each of which has 60 frames, amounting to 12,000 paired data. (Note that the first and last 30 frames of each video are NOT consecutive, and their darknesses are simulated differently as well.)

11 papers12 benchmarks

CALLHOME American English Speech

The CALLHOME English Corpus is a collection of unscripted telephone conversations between native speakers of English. Here are the key details:

11 papers0 benchmarks

NPHardEval

NPHardEval is a dynamic benchmark designed to assess the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of algorithmic questions. Let's delve into the details:

11 papers0 benchmarks

AmbigNQ

The AmbigNQ dataset is a valuable resource for exploring ambiguity in open-domain question answering. Let me provide you with some details:

11 papers0 benchmarks

Object HalBench

Object HalBench is a benchmark used to evaluate the performance of Language Models, particularly those that are multimodal (i.e., they can process and generate both text and images). It's designed to test how well these models can avoid "hallucinations" - generating text that is not factually grounded in the images they're processing¹.

11 papers2 benchmarks

EARS-WHAM

The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise files are mixed at signal-to-noise ratios (SNRs) randomly sampled in a range of [−2.5, 17.5] dB, where the SNR is computed using loudness K- weighted relative to full scale (LKFS) standardized in ITU-R BS.1770 to obtain a more perceptually meaningful scaling and also to remove silent regions from the SNR computation.

11 papers6 benchmarksSpeech

VNBench

VNBench is a comprehensive benchmark suite for video generative models, which evaluates video generation quality across specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. The suite includes 16 dimensions for evaluating Text-to-Video (T2V) models, such as subject consistency, motion smoothness, and overall consistency. VNBench also supports evaluating Image-to-Video (I2V) models and has recently introduced VBench-Long for evaluating longer videos¹. It's designed to align with human perceptions and provide valuable insights for future developments in video generation.

11 papers2 benchmarks

IMDB (Heterogeneous Node Classification)

A popular dataset for node classification on heterogeneous graphs.

11 papers3 benchmarks

HME100K

Source: HME100K

11 papers1 benchmarks

MRR-Benchmark (Multi-Modal Reading Benchmark)

Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.

11 papers1 benchmarksImages, Texts

Cityscapes-Seq

Cityscapes-Seq is a standard dataset for semantic urban scene understanding, featuring real-world videos from 50 cities in Germany and neighboring countries. It comprises 2,975 training video clips and 500 validation video clips, and each clip contains continuous 30 frames.

11 papers0 benchmarks

DF40

Forgery Diversity: DF40 comprises 40 distinct deepfake techniques (both representive and SOTA methods are included), facilitating the detection of nowadays' SOTA deepfakes and AIGCs. We provide 10 face-swapping methods, 13 face-reenactment methods, 12 entire face synthesis methods, and 5 face editing.

11 papers0 benchmarks

GMAI-MMBench

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

11 papers0 benchmarks
PreviousPage 149 of 1000Next