TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

SF-XL (San Francisco eXtra Large)

Large scale dataset for visual geo-localization / visual place recognition. It provides images from the city of San Francisco, labeled with GPS coordinates and heading.

20 papers0 benchmarks

MSU NR VQA Database (MSU No-Reference Video Quality Assessment Database)

The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected from 18,000+ open-source clips with high bitrate (license CCBY or CC0).

20 papers15 benchmarksImages, Videos

HIV (Human Immunodeficiency Virus)

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI), confirmed active (CA), and confirmed moderately active (CM).

20 papers0 benchmarks3D

MINTAKA

MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.

20 papers0 benchmarksTexts

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

20 papers2 benchmarksGraphs

20NewsGroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

20 papers6 benchmarks

RSSCN7

he RSSCN7 dataset contains satellite images acquired from Google Earth, which is originally collected for remote sensing scene classification. We conduct image synthesis on RSSCN7 to make it capable of the image inpainting task. It has seven classes: grassland, farmland, industrial and commercial regions, river and lake, forest field, residential region, and parking lot. Each class has 400 images, so there are total 2,800 images in the RSSCN7 dataset.

20 papers1 benchmarksImages

AVSBench (Audio −Visual Segmentation)

AVSBench is a pixel-level audio-visual segmentation benchmark that provides ground truth labels for sounding objects. The dataset is divided into three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied:

20 papers0 benchmarksAudio, Videos

PhotoChat

PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context.

20 papers10 benchmarksImages, Texts

FMB Dataset (Full-time Multi-modality Benchmark Dataset)

FMB contains 1500 well-registered infrared and visible image pairs with 14 annotated pixel-level categories. Also, it covers a wide range of pixel variations and various severe environments, e.g., dense fog, heavy rain, and low-light condition. The FMB dataset includes rich scenes under different illumination conditions, so that it enables fusion/segmentation model to improve the generalization ability greatly. We labeled 98.16% of all pixels into 14 different categories including Road, Sidewalk, Building, Traffic Light, Traffic Sign, Vegetation, Sky, Person, Car, Truck, Bus, Motorcycle, Bicycle and Pole, which often appear in real world automatic driving and semantic understanding tasks.

20 papers2 benchmarksImages

Social Chemistry 101

The Social Chemistry 101 dataset is a collection of over 290,000 rules of thumb (ROTs) for evaluating people’s behavior in everyday social situations.

20 papers0 benchmarks

Harm-P

Harm-P contains around 3,000 memes related to US politics.

20 papers2 benchmarks

MD22

Using conservation of energy -- a fundamental property of closed classical and quantum mechanical systems -- we develop an efficient gradient-domain machine learning (GDML) approach to construct accurate molecular force fields using a restricted number of samples from ab initio molecular dynamics (AIMD) trajectories. The GDML implementation is able to reproduce global potential energy surfaces of intermediate-sized molecules with an accuracy of 0.3 kcal $\text{mol}^{-1}$ for energies and 1 kcal $\text{mol}^{-1}$ $\text{\AA}^{-1}$ for atomic forces using only 1000 conformational geometries for training. We demonstrate this accuracy for AIMD trajectories of molecules, including benzene, toluene, naphthalene, ethanol, uracil, and aspirin. The challenge of constructing conservative force fields is accomplished in our work by learning in a Hilbert space of vector-valued functions that obey the law of energy conservation. The GDML approach enables quantitative molecular dynamics simulations

20 papers0 benchmarks3D

GEOM-DRUGS

GEOM-DRUGS is a dataset of 430,000 large organic molecules of up to 180 atoms from Axelrod and Gómez-Bombarelli, Nature Scientific Data, 2022.

20 papers3 benchmarks3D, Graphs

PRID2011 (Person RE-ID 2011)

PRID 2011 is a person reidentification dataset that provides multiple person trajectories recorded from two different static surveillance cameras, monitoring crosswalks and sidewalks. The dataset shows a clean background, and the people in the dataset are rarely occluded. In the dataset, 200 people appear in both views. Among the 200 people, 178 people have more than 20 appearances

19 papers5 benchmarks

FBMS-59 (Freiburg-Berkeley Motion Segmentation)

The Freiburg-Berkeley Motion Segmentation Dataset (FBMS-59) is a dataset for motion segmentation, which extends the BMS-26 dataset with 33 additional video sequences. A total of 720 frames is annotated. FBMS-59 comes with a split into a training set and a test set. Typical challenges appear in both sets.

19 papers36 benchmarksImages, Videos

CoNLL04

The CoNLL04 dataset is a benchmark dataset used for relation extraction tasks. It contains 1,437 sentences, each of which has at least one relation. The sentences are annotated with information about entities and their corresponding relation types.

19 papers10 benchmarks

University-1652

Contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. University-1652 is a drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation.

19 papers4 benchmarksImages

FNC-1 (Fake News Challenge Stage 1)

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement

19 papers6 benchmarksTexts

SIDER

SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations.

19 papers5 benchmarks
PreviousPage 104 of 1000Next