TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

Hate Counter

This dataset is built from Twitter and contains 1290 hate tweet and counterspeech reply pairs. After the annotation process, the dataset consists of 558 unique hate tweets from 548 user and 1290 counterspeech replies from 1239 users.

3 papers0 benchmarksTexts

BB-MAS (Behavioural Biometrics Multi-device and multi-Activity data from Same users)

BB-MAS is a behavioural biometrics dataset. It consists of data collected from 117 subjects for typing (both fixed and free text), gait (walking, upstairs and downstairs) and touch on Desktop, Tablet and Phone. The dataset consists a total of about: 3.5 million keystroke events; 57.1 million data-points for accelerometer and gyroscope each; 1.7 million data-points for swipes; and enables future research to explore previously unexplored directions in inter-device and inter-modality biometrics.

3 papers0 benchmarks

LIDDI (LInked Drug-Drug Interactions)

LInked Drug-Drug Interactions (LIDDI) is a public nanopublication-based RDF dataset with trusty URIs that encompasses some of the most cited prediction methods and sources to provide researchers a resource for leveraging the work of others into their prediction methods. As one of the main issues to overcome the usage of external resources is their mappings between drug names and identifiers used, the dataset also provides the set of mappings the authors curated to be able to compare the multiple sources aggregated in the dataset.

3 papers0 benchmarks

Human Optical Flow (Human Optical Flow dataset)

A synthetic data of videos of human action sequences and the corresponding optical flow.

3 papers0 benchmarks3D

2devs

2devs is a publicly available dataset of fine-grained untangled code changes collected by recording the development sessions of two developers over the course of four months, and the corresponding manual clustering.

3 papers0 benchmarks

Software Heritage Graph Dataset

Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on ''most starred'' repositories as it often happens.

3 papers0 benchmarksGraphs

AIR-Act2Act

AIR-Act2Act is a human-human interaction dataset for teaching non-verbal social behaviors to robots. It is different from other datasets because elderly people have participated in as performers. The authors recruited 100 elderly people and two college students to perform 10 interactions in an indoor environment. The entire dataset has 5,000 interaction samples, each of which contains depth maps, body indexes and 3D skeletal data that are captured with three Microsoft Kinect v2 cameras. In addition, the dataset also contains the joint angles of a humanoid NAO robot which are converted from the human behavior that robots need to learn.

3 papers0 benchmarks

RWCP-SSD-Onomatopoeia

RWCP-SSD-Onomatopoeia is a dataset consisting of 155,568 onomatopoeic words paired with audio samples for environmental sound synthesis.

3 papers0 benchmarksAudio

putEMG

putEMG and putEMG-Force datasets are databases of surface electromyographic activity recorded from forearm. Datasets allows for development of algorithms for gesture recognition and grasp force recognition. Experiment was conducted on 44 participants, with two repetitions separated by, minimum of one week. The dataset includes 7 active gestures (like hand flexion, extension, etc.) + idle and a set of trials with isometric contractions. sEMG was recorded using a 24-electrode matrix.

3 papers0 benchmarks

COOLL (Controlled On/Off Loads Library)

Controlled On/Off Loads Library (COOLL) is a dataset of high-sampled electrical current and voltage measurements representing individual appliances consumption. The measurements were taken in June 2016 in the PRISME laboratory of the University of Orléans, France. The appliances are mainly controllable appliances (i.e. we can precisely control their turn-on/off time instants). 42 appliances of 12 types were measured at a 100 kHz sampling frequency.

3 papers0 benchmarks

Natural Hazards Twitter Dataset

Natural Hazards is a natural disaster dataset with sentiment labels, which contains nearly 50,00 Twitter data about different natural disasters in the United States (e.g., a tornado in 2011, a hurricane named Sandy in 2012, a series of floods in 2013, a hurricane named Matthew in 2016, a blizzard in 2016, a hurricane named Harvey in 2017, a hurricane named Michael in 2018, a series of wildfires in 2018, and a hurricane named Dorian in 2019).

3 papers0 benchmarksTexts

DACT (Dataset of Annotated Car Trajectories)

DACT contains two subsets of annotated car trajectories data. The dataset contains 50 trajectories which cover about 13 hours of driving data. In DACT, we manually specified significant driving patterns by using an interactive framework. A significant driving pattern can be anything like a turn, speed-up, slow-down, etc. The annotation process consists of a crowd-sourcing task followed by comprehensive aggregation phases. The aggregation is done by two different strategies: Strict and Easy. For the first one, we used some strict constraints to aggregate crowd-sourcing results, while we used flexible constraints to generate the second subset of DACT.

3 papers0 benchmarks

UPFD-GOS (User Preference-aware Fake News Detection)

The Gossipcop variant of the UPFD dataset for benchmarking.

3 papers2 benchmarksGraphs, Texts

MRPB 1.0

MRPB 1.0 is a mobile robot local planning benchmark. The benchmark facilitates both motion planning researchers who want to compare the performance of a new local planner relative to many other state-of-the-art approaches as well as end users in the mobile robotics industry who want to select a local planner that performs best on some problems of interest.

3 papers0 benchmarksEnvironment

Algonauts 2021 (How the Human Brain Makes Sense of a World in Motion)

The Algonauts dataset provides human brain responses to a set of 1,102 3-s long video clips of everyday events. The brain responses are measured with functional magnetic resonance imaging (fMRI). fMRI is a widely used brain imaging technique with high spatial resolution that measures blood flow changes associated with neural responses.

3 papers0 benchmarksVideos, fMRI

Healthline

Healthline is a nutrition related dataset for multi-document summarization, using scientific studies.

3 papers0 benchmarksTexts

P3 (Psychophysical Patterns Dataset)

A set of patterns used in psychophysical research to evaluate the ability of saliency algorithms to find targets distinct from distractors in orientation, color and size. Each image is a 7x7 grid and contains a single target. All images are 1024x1024px and have corresponding ground truth masks for the target and distractors.

3 papers0 benchmarksImages, Texts

KvasirCapsule-SEG

The dataset contains a Video capsule endoscopy dataset for polyp segmentation.

3 papers2 benchmarksBiomedical, Cad, Images, Medical

CollATe

The CollATe dataset is large dataset consisting of two types of collusive entities on YouTube – videos submitted to gain collusive likes and comment requests, and channels submitted to gain collusive subscriptions.

3 papers0 benchmarks

OREBA (Objectively Recognizing Eating Behavior and Associated Intake)

The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands.

3 papers0 benchmarksActions, Videos
PreviousPage 267 of 1000Next