TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

Video Waterdrop Removal Dataset

Due to the lack of training data for video waterdrop removal, we propose a large-scale synthetic dataset with simulated waterdrops in complex driving scenes on rainy days.

5 papers2 benchmarksVideos

Infinity-MM

We collect, organize and open-source the large-scale multimodal instruction dataset, Infinity-MM, consisting of tens of millions of samples. Through quality filtering and deduplication, the dataset has high quality and diversity. We propose a synthetic data generation method based on open-source models and labeling system, using detailed image annotations and diverse question generation.

5 papers0 benchmarksImages, Texts, Videos

Multi-IF (Multi-turn and multilingual instruction following)

We introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models’ multilingual capabiliti

5 papers0 benchmarks

ComplexCodeEval

ComplexCodeEval ComplexCodeEval is an evaluation benchmark designed to accommodate multiple downstream tasks, accurately reflect different programming environments, and deliberately avoid data leakage issues. This benchmark includes a diverse set of samples from real-world projects, aiming to closely mirror actual development scenarios.

5 papers0 benchmarks

FlenQA

A synthetically generated QA dataset for text-based reasoning. For each sample, composed of a True/False question over two pieces of information required to answer it (the context), we create multiple versions of different lengths by embedding the context parts within longer, irrelevant texts. To ensure that models utilize their entire input, the dataset is composed of tasks for which both pieces of information must reasoned over together in order to correctly answer the question. At the same time, we keep the tasks simple enough such that models answer most of them correctly when the information pieces are presented on their own, with no additional padding.

5 papers0 benchmarks

MultiOFF

Introudced from Multimodal Meme Dataset (MultiOFF) for Identifying Offensive Content in Image and Text

5 papers2 benchmarks

Helvipad

The Helvipad dataset is a real-world stereo dataset designed for omnidirectional depth estimation. It comprises 39,553 paired equirectangular images captured using a top-bottom 360° camera setup and corresponding pixel-wise depth and disparity labels derived from LiDAR point clouds. The dataset spans diverse indoor and outdoor scenes under varying lighting conditions, including night-time environments.

5 papers24 benchmarksImages

HDR-GS (HDR-GS: Efficient High Dynamic Range Novel View Synthesis at 1000x Speed via Gaussian Splatting)

This is dataset for high dynamic range novel view synthesis. It is collected by HDR-NeRF and recalibrated by HDR-GS for the research of 3DGS-based algorithms. This dataset contains 8 synthetic scenes and 4 real scenes.

5 papers3 benchmarks

HourVideo

We introduce HourVideo, a benchmark dataset for hour-long video-language understanding. HourVideo consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. We hope to establish HourVideo as a benchmark challenge to spur the development of advanced multimodal models capable of truly understanding endless streams of visual data.

5 papers0 benchmarksTexts, Videos

Zero-shot Video Question Answering on LongVideoBench (A Benchmark for Long-context Interleaved Video-Language Understanding)

Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referri

5 papers2 benchmarksVideos

DD100

A large-scale and diverse duet interactive dance dataset. Recording about 117 minutes of professional dancers' performances.

5 papers0 benchmarks3D, Music

PECAN (Paratope-Epitope Complexes for Antibody Networks (PECAN))

The PECAN dataset provides structural data for antibody-antigen interactions, specifically curated for paratope and epitope binding site prediction. It includes a diverse set of antibody-antigen complexes, ensuring a well-balanced and representative dataset for training and evaluating deep learning models in protein-protein interaction (PPI) tasks.

5 papers4 benchmarksBiology

TextAtlasEval

A Dense-text Image Benchmark to evaluate large generation model's ability on text generation.

5 papers15 benchmarksImages, Texts

Open6DOR V2 (Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach)

We introduce a challenging and comprehensive benchmark for open-instruction 6-DoF object rearrangement tasks, termed Open6DOR.

5 papers6 benchmarksImages, Texts

Marvel

a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.

5 papers0 benchmarks

StreamingBench

StreamingBench evaluates Multimodal Large Language Models (MLLMs) in real-time, streaming video understanding tasks. 🌟

5 papers0 benchmarks

RefRef (RefRef: A Synthetic Dataset and Benchmark for Reconstructing Refractive and Reflective Objects)

RefRef is a synthetic dataset and benchmark designed for the task of reconstructing scenes with complex refractive and reflective objects. Our dataset consists of 50 objects categorized based on their geometric and material complexity: single-material convex objects, single-material non-convex objects, and multi-material non-convex objects, where the materials have different colors, opacities, and refractive indices. Each object is placed in three distinct bounded environments and one unbounded environment, resulting in 150 unique scenes with diverse geometries, material properties, and backgrounds. Our dataset provides a controlled setting for evaluating and developing 3D reconstruction and novel view synthesis methods that handle complex optical effects.

5 papers1 benchmarks3D, Images

RealMAN (A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization)

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset, which provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:

5 papers7 benchmarksAudio, Speech

ImplicitQA

The ImplicitQA dataset was introduced in the paper ImplicitQA: Going beyond frames towards Implicit Video Reasoning.

5 papers0 benchmarksTexts, Videos

PCBA

PCBA dataset 11 is a collection of high-quality dose-response data, formulated as a multitask learning benchmark from 128 high-throughput screening (HTS) assays.

4 papers3 benchmarks
PreviousPage 229 of 1000Next