TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

Tc1 Mouse cerebellum atlas (Tc1 Mouse cerebellum atlas with Purkinje layer segmentation)

This mouse cerebellar atlas can be used for mouse cerebellar morphometry.

2 papers0 benchmarks3D, Biomedical, Images, MRI, Medical

H01

The H01 dataset is a 1.4 petabyte rendering of a small sample of human brain tissue, released by a collaboration between the Lichtman Laboratory at Harvard University and Google. The H01 sample was imaged at 4nm-resolution by serial section electron microscopy, reconstructed and annotated by automated computational techniques, and analyzed for preliminary insights into the structure of the human cortex.

2 papers0 benchmarksBiology

iMet Collection

A dataset for fine-grained art attribute recognition introduced in the 6th FGVC Workshop at CVPR 2019. It is a high-quality artwork image dataset with professional photographs of artworks from The Metropolitan Museum of Art and attribute labels curated or verified by experts.

2 papers0 benchmarks

The 'Call me sexist but' Dataset (CMSB)

Tweets and items from psychological scales for sexism detection with counterfactual examples.

2 papers0 benchmarksTexts

UIT-ViSFD (Vietnamese Aspect-Based Sentiment Analysis Dataset)

UIT-ViSFD is a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes.

2 papers0 benchmarks

Stanford Schema2QA Dataset

Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.

2 papers0 benchmarksTexts

Herbarium 2021 Half–Earth

The Herbarium Half-Earth dataset is a large and diverse dataset of herbarium specimens to date for automatic taxon recognition. The Herbarium 2021: Half-Earth Challenge dataset includes more than 2.5M images representing nearly 65,000 species from the Americas and Oceania that have been aligned to a standardized plant list.

2 papers2 benchmarksImages

notMNIST

2 papers0 benchmarksImages

5k_presetation_slides (5000 presentation slide pairs)

We crawled 5000 paper, slide pairs from conference proceeding websites. (e.g. acl.org and usenix.org).

2 papers0 benchmarksTexts

CiteWorth

CiteWorth is a a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plain-text scientific documents.

2 papers0 benchmarksTexts

IndiaPoliceEvents

IndiaPoliceEvents is a corpus of 21,391 sentences from 1,257 English-language Times of India articles about events in the state of Gujarat during March 2002. This dataset is used for automated event extraction.

2 papers0 benchmarksTexts

MultiOpEd

MultiOpEd is a corpus of multi-perspective news editorials. It is an open-domain news editorial corpus that supports various tasks pertaining to the argumentation structure in news editorials, focusing on automatic perspective discovery. News editorial is a genre of persuasive text, where the argumentation structure is usually implicit. However, the arguments presented in an editorial typically center around a concise, focused thesis, which we refer to as their perspective. MultiOpEd aims at supporting the study of multiple tasks relevant to automatic perspective discovery, where a system is expected to produce a single-sentence thesis statement summarizing the arguments presented.

2 papers0 benchmarksTexts

Rent3D++

Rent3D++ is an extension of the Rent3D floorplans + photos dataset. The floorplans are annotated with room outline polygons, doors/windows as line segments, object-icons as axis-aligned bounding boxes, room-door-room connectivity graphs, and photo-room assignments. We have extracted rectified surface crops from architectural surfaces in photos, and these can drive interior texturing/material modeling tasks. This dataset can be used with our paper Plan2Scene to generate textured 3D mesh models of houses using floorplans and photos.

2 papers15 benchmarksGraphs, Images

TI1K Dataset (Thumb Index 1000 Hand & Fingertip Detection Dataset)

Thumb Index 1000 (TI1K) is a dataset of 1000 hand images with the hand bounding box, and thumb and index fingertip positions. The dataset includes the natural movement of the thumb and index fingers making it suitable for mixed reality (MR) applications.

2 papers0 benchmarksActions, Environment, Images, RGB Video

TCIA Brain-Tumor-Progression

This collection includes datasets from 20 subjects with primary newly diagnosed glioblastoma who were treated with surgery and standard concomitant chemo-radiation therapy (CRT) followed by adjuvant chemotherapy. Two MRI exams are included for each patient: within 90 days following CRT completion and at progression (determined clinically, and based on a combination of clinical performance and/or imaging findings, and punctuated by a change in treatment or intervention). All image sets are in DICOM format and contain T1w (pre and post-contrast agent), FLAIR, T2w, ADC, normalized cerebral blood flow, normalized relative cerebral blood volume, standardized relative cerebral blood volume, and binary tumor masks (generated using T1w images). The perfusion images were generated from dynamic susceptibility contrast (GRE-EPI DSC) imaging following a preload of contrast agent. All of the series are co-registered with the T1+C images. The intent of this dataset is for assessing deep learnin

2 papers0 benchmarksMRI

Python Programming Puzzles (P3)

Python Programming Puzzles (P3) is an open-source dataset where each puzzle is defined by a short Python program , and the goal is to find an input which makes output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier, so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding.

2 papers0 benchmarksTexts

Titanic (Titanic - Machine Learning from Disaster)

Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.

2 papers1 benchmarksTabular

Learning to Autofocus

This dataset contains 510 focal stacks (49 different focal distances) from in-the-wild scenes with calculated depth from SFM. This dataset was designed for research on Autofocus but can be used for any research which is interested in focal stacks, defocus cues, or depth signals (particularly for interest in close depth).

2 papers0 benchmarks

RyanSpeech

RyanSpeech is a speech corpus for research on automated text-to-speech (TTS) systems. This dataset contains textual materials from real-world conversational settings. These materials contain over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz.

2 papers0 benchmarksSpeech

DisKnE (Disease Knowledge Evaluation)

DisKnE is a benchmark for Disease Knowledge Evaluation built from MedNLI and MEDIQA-NLI. This benchmark is constructed to specifically test the medical reasoning capabilities of ML models, such as mapping symptoms to diseases.

2 papers0 benchmarksMedical, Texts
PreviousPage 313 of 1000Next