TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

AV-Deepfake1M

The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant

9 papers0 benchmarksVideos

TellMeWhy

Answering questions about why characters perform certain actions is central to understanding and reasoning about narratives. The actions people perform are steps of plans to achieve their desired goals. When interpreting language, humans naturally understand the reasons behind described actions, even when the reasons are left unstated. Despite recent progress in question answering, it is not clear if existing models can answer "why" questions that may require commonsense knowledge external to the input narrative.

9 papers0 benchmarks

VDS dataset: Multi exposure stack-based inverse tone mapping

Have need seven multiple exposure ground truth images satisfying EV 0, ±1, ±2, ±3 for static scenes. 96 different scene image stacks (672 images; out-door: 504, indoor: 168 images)

9 papers12 benchmarksImages

KaMed

KaMed is a knowledge-aware medical dialogue dataset, which contains over 60,000 medical dialogue sessions with 5,682 entities (such as Asthma and Atropine).

9 papers0 benchmarksTexts

AdvBench

To systematically evaluate the effectiveness of our approach at accomplishing this, we designed a new benchmark, AdvBench, based on two distinct settings.

9 papers0 benchmarks

ATEPP (Automatically Transcribed Expressive Piano Performances)

ATEPP is a dataset of expressive piano performances by virtuoso pianists. The dataset contains 11677 performances (~1000 hours) by 49 pianists and covers 1580 movements by 25 composers. All of the MIDI files in the dataset come from the piano transcription of existing audio recordings of piano performances. Scores in MusicXML format are also available for around half of the tracks. The dataset is organized and aligned by compositions and movements for comparative studies.

9 papers0 benchmarksMidi, Music

Animal3D

Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and importantly the pose and shape parameters of the SMAL model. All annotations were labeled and checked manually in a multi-stage process to ensure highest quality results. Based on the Animal3D dataset, we benchmark representative shape and pose estimation models at: (1) supervised learning from only the Animal3D data, (2) synthetic to real transfer from synthetically generated images, and (3) fine-tuning human pose and shape estim

9 papers4 benchmarks

CC3D

The CC3D dataset [1] of 3D CAD models was collected from a free online service for sharing CAD designs [2]. In total, the collected dataset contains 50k+ models, unrestricted to any category, with varying complexity from simple to highly detailed designs. These CAD models are converted to meshes, and each mesh was virtually scanned using a proprietary 3D scanning pipeline developed by Artec3D [3]. The typical size of the resulting scans is in the order of 100K points and faces, while the meshes converted from CAD models are usually more than an order of magnitude lighter. The availability of CAD-3D scan pairings, the high-resolution of meshes, and the variability of the models make the CC3D dataset stand out among other alternatives.

9 papers8 benchmarks

RU-APC (Rutgers APC)

The RU-APC (Rutgers APC) dataset is a valuable resource for researchers and developers working on robotic perception solutions for warehouse picking challenges. Let me provide you with some details about this dataset:

9 papers0 benchmarks

PhenoBench (PhenoBench — A Large Dataset and Benchmarks for Semantic Image Interpretation in the Agricultural Domain)

The PhenoBench dataset contains multiple image segmentation challenges from the agricultural domain.

9 papers0 benchmarksImages

UHD-IQA

We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. Importantly, the dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in highly reliable labels. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its

9 papers4 benchmarksImages

Freebase (Heterogeneous Node Classification)

A popular dataset for node classification on heterogeneous graphs.

9 papers4 benchmarks

OAG-Venue

A popular dataset for node classification on heterogeneous graphs.

9 papers2 benchmarks

LIAR-RAW

For LIAR-RAW, we extended the public dataset LIAR-PLUS (Alhindi et al., 2018) with relevant raw reports, containing fine-grained claims from Politifact. LIAR-RAW is based on LIAR, where gold labels refer to Politifact. To alleviate the dependency of fact-checked reports, we extended the public LIAR dataset with additional raw reports for each claim. Besides, we put these raw reports into a single file with the format of LIAR.

9 papers0 benchmarks

SALMon

The SALMon dataset and benchmark was introduced in the paper "A Suite for Acoustic Language Model Evaluation", with the goal of evaluating the modelling abilities of speech language models with regards to different kinds of acoustic elements.

9 papers8 benchmarksAudio

SD-Eval

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The da

9 papers0 benchmarksAudio, Speech, Texts

iWildCam2020-WILDS

The iWildCam2020-WILDS dataset is a variant of the iWildCam 2020 dataset. iWildCam2020-WILDS is a benchmark dataset designed to test OOD generalization for the task of species classification. The label space consists of 182 species. Each domain corresponds to a different location of the camera trap. The training and test images belong to disjoint sets of locations in the OOD setting.

9 papers1 benchmarksImages, Texts

Cybench

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps

9 papers0 benchmarks

VLM2-Bench (VLM²-Bench)

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching Description VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

9 papers10 benchmarksImages, Texts, Videos

Kinetics-700-2020

We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.

9 papers1 benchmarksVideos
PreviousPage 167 of 1000Next