TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

ShapeNet Intrinsic Images v1.0

The synthetic ShapeNet intrinsic image decomposition dataset used for training the deep CNN models IntrinsicNet and RetiNet of CVPR2018. See Section 4.1 of the paper for details.

2 papers0 benchmarksImages

HalluEditBench

HalluEditBench is a comprehensive benchmark for evaluating knowledge editing methods' effectiveness in correcting real-world hallucinations. HalluEdit features a rigorously constructed dataset spanning nine domains and 26 topics. It evaluates methods across five dimensions: Efficacy, Generalization, Portability, Locality, and Robustness.

2 papers0 benchmarksTexts

I2-2000FPS

I2-2000FPS is the first high-speed video dataset offering an unprecedented temporal resolution of 2000 frames per second (fps). Captured using the commercially available Chronos 1.4 high-speed CMOS camera, the dataset includes a diverse range of objects varying in size, shape, orientation, and motion, as well as various camera movements. This dataset is designed to enable research in areas such as motion analysis, object tracking, and scene understanding at extreme temporal resolutions. Potential applications span fields like sports analysis, robotics, autonomous navigation, and high-speed videography.

2 papers0 benchmarksImages, Videos

UPenn-GBM (The University of Pennsylvania glioblastoma (UPenn-GBM) cohort)

This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format. Scans were initially skull-stripped and co-registered, before their tumor segmentation labels were produced by an automated computational method. These segmentation labels were revised and any label misclassifications were manually corrected/approved by expert board-certified neuroradiologists. The final labels were used to extract a rich panel of imaging features, including intensity, volumetric,

2 papers0 benchmarksMRI, Tabular

Metadata for all 622 UCI datasets

This dataset contains the extraction made in 2022 of all the 622 datasets that existed then at the UCI Machine Learning Repository. It contains the index, its name, its url, the instances (number os lines), the number of attributes (columns), the year it was created, the area, such as Life, Social, etc., the web_hits at the time, the data folder url, where the data were in the internet, the dataset_file_url, the URL for the data, the dataset_file_format (format, such as data, txt, Z, etc), the names_file_url, which describe the files with the description of the attributes, the names_file_format which describe the format of the previous file, the attribute_info, which describe the information of all the attributes or columns that are in the dataset, the source, the data_set_information, the relevant_papers associated with this dataset, the papers_that_cite_this_data_set, and a final column with the number of papers that cite this dataset.

2 papers0 benchmarksTabular

gitbug-java

A set of java bugs With executable test cases

2 papers0 benchmarks

COVID-QU-Ex

The researchers of Qatar University have compiled the COVID-QU-Ex dataset, which consists of 33,920 chest X-ray (CXR) images including: * 11,956 COVID-19 * 11,263 Non-COVID infections (Viral or Bacterial Pneumonia) * 10,701 Normal Ground-truth lung segmentation masks are provided for the entire dataset. This is the largest ever created lung mask dataset.

2 papers0 benchmarksImages

MultiPref

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

latent-dna-diffusion

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

SynMirror

SynMirror consists of samples rendered from 3D assets of two widely used 3D object datasets - Objaverse and Amazon Berkeley Objects (ABO) placed in front of a mirror in a virtual blender environment. The total number of rendered samples are $198,204$. Each rendering contains colors, category_id_segmaps, depth, normals and cam_states.

2 papers0 benchmarksImages

PhysioNet Challenge 2019

Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019

2 papers0 benchmarks

InfiniteBench (∞Bench: Extending Long Context Evaluation Beyond 100K Tokens)

Introduction Welcome to InfiniteBench, a cutting-edge benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts (100k+ tokens). Long contexts are crucial for enhancing applications with LLMs and achieving high-level interaction. InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+, which is 10 times longer than traditional datasets.

2 papers0 benchmarksTexts

Reddit Ideological and Extreme Bias Dataset

Articles originating from subreddits with explicitly stated ideologies are categorized into three groups: 72,488 articles in the Liberal class, 79,573 articles in the Conservative class, and 225,083 articles in the Restricted class.

2 papers2 benchmarksTables, Tabular, Texts

ThreatGram 101 - Extreme Telegram Data (ThreatGram 101 - Extreme Telegram Replies Data with Threat Levels)

Data 1: Raw and Unlabeled; 2 million unlabeled replies from 17 Telegram channels. Data 2: Raw and Labeled; 15,076 replies from 17 Telegram channels categorized as no threat, judicial threat, and non-judicial threat.

2 papers2 benchmarksTexts

Situation Puzzle

LLMs' lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels.

2 papers0 benchmarksTexts

LeukemiaAttri

The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological information. This dataset comprises a total of 28.9K images (2.4K × 2 × 3 × 2), which were captured using both low-cost and high-cost microscopes at three different resolutions: 10x, 40x, and 100x, utilizing various cameras. In addition to providing location annotations for each white blood cell (WBC), the dataset includes comprehensive morphological attributes for every WBC, enhancing its utility for research and analysis in the field.

2 papers6 benchmarksBiology, Biomedical, Images, Medical

NASA Li-ion Dataset

Experiments on Li-Ion batteries. Charging and discharging at different temperatures. Records the impedance as the damage criterion. The data set was provided by the NASA Prognostics Center of Excellence (PCoE).

2 papers1 benchmarks

AIME (AI Music Evaluation Dataset)

The AIME dataset contains 6,000 audio tracks generated by 12 music generation models in addition to 500 tracks from MTG-Jamendo. The prompts used to generate music are combinations of representative and diverse tags from the MTG-Jamendo dataset.

2 papers0 benchmarksAudio, Music

capsule vision challenge 2024

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

Decoder_Encoder_Dict

The dataset enables the mapping from text space to numerical space and vice versa.

2 papers0 benchmarks
PreviousPage 354 of 1000Next