TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

VRDS (Video Raindrop and Rain Streak Removal)

We generate a synthesized dataset, namely VRDS, with 102 rainy videos from diverse scenarios, and each video frame has the corresponding rain streak map, raindrop mask, and the underlying rain-free clean image (ground truth). This dataset serves as a valuable resource for researchers in this field to develop and test novel methods for the removal of rain streaks and raindrops from video data. To enable our model to cope with various lighting conditions, we considered different weather scenarios, particularly cloudy conditions due to the close correlation between cloudy and rainy conditions. All of the scenarios are present in both the training and test sets, thereby allowing for fair and accurate comparisons between different methods on our dataset. We captured a total of 102 videos, 72 of which were used for training and 30 for testing. The selected video resolution is 1280$\times$720, and each contains 100 frames.

8 papers2 benchmarks

ColonINST-v1

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, and 303,001 colonoscopy images, including 128,620 positive and 174,381 negative cases collected from 19 publicly available datasets. We enhanced 128,620 colonoscopy images with detailed captions using a pipeline that interacts with GPT-4V through custom prompts, enriching the dataset for AI model training. We finally restructured 450,724 visual dialogues to guide the AI model through four downstream tasks critical for multimodal medical AI applications.

8 papers0 benchmarks

ColonINST-v1 (Seen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy images, including 128,620 positive and 174,381 negative cases collected from 19 publicly available datasets. We enhanced 128,620 colonoscopy images with detailed captions using a pipeline that interacts with GPT-4V through custom prompts, enriching the dataset for AI model training. We finally restructured 450,724 visual dialogues to guide the AI model through four downstream tasks critical for mutlimodal medical AI applications.

8 papers2 benchmarks

ColonINST-v1 (Unseen)

ColonINST is a large-scale instruction tuning dataset designed for multimodal analysis in colonoscopy. This dataset comprises 62 categories, 303,001 colonoscopy images, including 128,620 positive and 174,381 negative cases collected from 19 publicly available datasets. We enhanced 128,620 colonoscopy images with detailed captions using a pipeline that interacts with GPT-4V through custom prompts, enriching the dataset for AI model training. We finally restructured 450,724 visual dialogues to guide the AI model through four downstream tasks critical for multimodal medical AI applications.

8 papers2 benchmarks

Spider 2.0

Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various types of databases, such as BigQuery, Snowflake, Postgres, ClickHouse, DuckDB, and SQLite. It is required to engage with complex SQL workflows, process extensive contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines across multiple interactions.

8 papers2 benchmarksEnvironment, Texts

ACI-Bench

Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

8 papers3 benchmarks

TruckScenes (MAN TruckScenes)

Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public roads is one of the main duties that requires an accurate perception of the environment. To achieve this, machine learning methods rely on large datasets, but to this day, no such datasets are available for autonomous trucks. In this work, we present MAN TruckScenes, the first multimodal dataset for autonomous trucking. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time. It comprises more than 740 scenes of 20s each within a multitude of different environmental conditions. The sensor set includes 4 cameras, 6 lidar, 6 radar sensors, 2 IMUs, and a high-precision GNSS. The dataset's 3D bounding boxes were manually annotated and carefully reviewed to achieve a high quality standard. Bounding boxes are availa

8 papers12 benchmarks3D, Images, Point cloud

ZEB (Zero-shot Evaluation Benchmark)

A evaluation benchmark ZEB for image matching by merging 8 real-world datasets and 4 simulated datasets with diverse image resolutions, scene conditions and view points.

8 papers1 benchmarksImages

SimplerEnv-Widow X (SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups)

Significant progress has been made in building generalist robot manipulation policies, yet their scalable and reproducible evaluation remains challenging, as real-world evaluation is operationally expensive and inefficient. We propose employing physical simulators as efficient, scalable, and informative complements to real-world evaluations. These simulation evaluations offer valuable quantitative metrics for checkpoint selection, insights into potential real-world policy behaviors or failure modes, and standardized setups to enhance reproducibility.

8 papers6 benchmarks

LUN

LUN is used for unreliable news source classification, this dataset includes 17,250 articles from satire, propaganda, and hoaxe.

8 papers3 benchmarks

MIA-Bench

A Benchmark to evaluate complex instruction following.

8 papers0 benchmarks

VL-RewardBench

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to s

8 papers0 benchmarksImages

OpenS2V-Eval

OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics: NexusScore, NaturalScore, GmeScore to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 14 representative S2V models, highlighting their strengths and weaknesses across different content.

8 papers32 benchmarksImages, Texts, Videos

DTTD-Mobile

Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR? We introduce DTTD-Mobile (fully compatible w/ YCB toolbox), a new benchmark built on real-world data captured from mobile devices; 18 objects observed in 100 videos with 47,668 sampled frames and 114,143 object annotations. We evaluate several popular methods—including BundleSDF, ES6D, MegaPose, and DenseFusion—and highlight their limitations in this challenging setting.

8 papers30 benchmarks3D, Images, Videos

FeTS2022 (Federated Tumor Segmentation Challenge 2022)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

8 papers0 benchmarks3D, Images, Medical

TimeBank

Enriches the TimeML annotations of TimeBank by adding information about the Topic Time in terms of Klein (1994). The annotations are partly automatic, partly inferential and partly manual. The corpus was converted into the native format of the annotation software GraphAnno and POS-tagged using the Stanford bidirectional dependency network tagger.

7 papers4 benchmarks

IconArt

This dataset contains 5955 painting images (from WikiCommons) : a train set of 2978 images and a test set of 2977 images (for classification task). 1480 of the 2977 images are annotated with bounding boxes for 7 iconographic classes : ‘angel’,‘Child_Jesus’,‘crucifixion_of_Jesus’,‘Mary’,‘nudity’, ‘ruins’,‘Saint_Sebastien’.

7 papers5 benchmarksImages

Ego2Top

Contains annotated egocentric and top-view videos.

7 papers3 benchmarks

PROBA-V (PROBA-V Super-Resolution dataset)

The PROBA-V Super-Resolution dataset is the official dataset of ESA's Kelvins competition for "PROBA-V Super Resolution". It contains satellite data from 74 hand-selected regions around the globe at different points in time. The data is composed of radiometrically and geometrically corrected Top-Of-Atmosphere (TOA) reflectances for the RED and NIR spectral bands at 300m and 100m resolution in Plate Carrée projection. The 300m resolution data is delivered as 128x128 grey-scale pixel images, the 100m resolution data as 384x384 grey-scale pixel images. Additionally, a quality map is provided for each pixel, indicating whether the pixels are concealed (i.e. by clouads, ice, water, missing information, etc.).

7 papers4 benchmarksImages

WeChat

The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.

7 papers2 benchmarksTexts
PreviousPage 179 of 1000Next