TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

285 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

285 dataset results

AIDS Antiviral Screen

The AIDS Antiviral Screen dataset is a dataset of screens checking tens of thousands of compounds for evidence of anti-HIV activity. The available screen results are chemical graph-structured data of these various compounds.

1 papers0 benchmarksGraphs

KACC

The KACC benchmark consists of three subtasks that can be applied to knowledge graphs: knowledge abstraction, knowledge concretization and knowledge completion.

1 papers0 benchmarksGraphs

HAM (Human-annotated Mappings)

HAM is a dataset for molecular graph partitioning. This dataset contains coarse-grained (CG) mappings of 1206 organic molecules with less than 25 heavy atoms. Each molecule was downloaded from the PubChem database as SMILES. One molecule was assigned to two annotators to compare the human agreement between CG mappings. Downloaded SMILES were hand-mapped. The completed annotations were reviewed by a third person, to identify and remove unreasonable mappings (eg: one bead mappings) which did not agree with the given guidelines. Hence, there are 1.68 annotations per molecule in the current database (16% removed).

1 papers0 benchmarksGraphs

SidechainNet

SidechainNet is a protein structure prediction dataset that directly extends ProteinNet. Specifically, SidechainNet adds measurements for protein angles and coordinates that describe the complete, all-atom protein structure (backbone and sidechain, excluding hydrogens) instead of the protein backbone alone.

1 papers0 benchmarksGraphs

TextWorld KG

TextWorld KG is a dynamic Knowledge Graph (KG) extraction dataset. It is based on a set of text-based games generated using. That framework allows to extract the underlying partial KG for every state, i.e., the subgraph that represents the agent’s partial knowledge of the world – what it has observed so far. All games share the same overarching theme: the agent finds itself hungry in a simple modern house with the goal of gathering ingredients and cooking a meal.

1 papers0 benchmarksGraphs, Texts

FB1.5M

The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.

1 papers0 benchmarksGraphs

PART-OF

The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.

1 papers0 benchmarksGraphs, Medical, Texts

l2d (Learning to Dance)

This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.

1 papers0 benchmarksActions, Audio, Graphs

hERG

hERG is a large-scale biophysics federated molecular dataset related to cardiac toxicity. It consists of 10,572 compounds, with an average of 29.39 nodes and 94.09 edges in each graph.

1 papers0 benchmarksGraphs

JoCAD

JoCAD is a dataset for anomaly detection in citation networks.

1 papers0 benchmarksGraphs

HoaxItaly

HoaxItaly consists of over 1 million tweets shared during 2019 and containing links to thousands of news articles published on two classes of Italian outlets: (1) disinformation websites, i.e. outlets which have been repeatedly flagged by journalists and fact-checkers for producing low-credibility content such as false news, hoaxes, click-bait, misleading and hyper-partisan stories; (2) fact-checking websites which notably debunk and verify online news and claims. The dataset includes title and body for approximately 37k news articles.

1 papers0 benchmarksGraphs, Texts

YoutubeGraph-Dyn

YoutubeGraph-Dyn is an evolving graph dataset generated from YouTube real-world interactions. It can be used to study temporal evolution on graphs. YoutubeGraph-Dyn provides intra-day time granularity (with 416 snapshots taken every 6 hours for a period of 104 days), multi-modal relationships that capture different aspects of the data, multiple attributes including timestamped, non-timestamped, word embeddings, and integers.

1 papers0 benchmarksGraphs

ReviewRobot Dataset

ReviewRobot Dataset Overview This repository contains data for paper ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. [Dataset]

1 papers0 benchmarksGraphs, Texts

Classic ECN AQM Fall-Back

Clickable heat-map visualizations of the experiments run to quantify the Classic ECN AQM problem and to evaluate the success of the Classic AQM Detection and Fall-back algorithm.

1 papers0 benchmarksGraphs, Images

CTFW

CTFW is a large annotated procedural text dataset in the cybersecurity domain (3154 documents). It is used to generate flow graphs from procedural texts.

1 papers0 benchmarksGraphs, Texts

LSEC (Live Stream E-Commerce)

The LSEC (Live Stream E-Commerce) dataset has two subsets: LSEC-Small and LSEC-Large. It is a dataset for studying E-commerce transactions in the context of live streams, where the streames are talking about products while interacting with their audience. The dataset consists of interaction information among streamers, users, and products.

1 papers0 benchmarksGraphs

Color-connectivity

Synthetic graph classification datasets with the task of recognizing the connectivity of same-colored nodes in 4 graphs of varying topology.

1 papers0 benchmarksGraphs

MovieGraphBenchmark

The dataset contains entities from IMDB, TheMovieDB and TheTVDB with goldstandard matches between the sources. Due to the licensing of IMDB we provide a script to build the IMDB part of the dataset yourself.

1 papers0 benchmarksGraphs

Building air quality and pandemic risk simulation

The original paper contains a high-level explanation of the dataset characteristics, and potential use cases of the dataset. ArchABM can help to quantify the impact of some of these building- and company policy-related measures.

1 papers0 benchmarksGraphs, Time series

GO21

GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.

1 papers4 benchmarksBiology, Graphs
PreviousPage 11 of 15Next