TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

148 dataset results

lilGym

lilGym is a benchmark for language-conditioned reinforcement learning in visual environment based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty.

1 papers0 benchmarksEnvironment

pursuitMW (Multi-agent pursuit in matrix world)

Multi-agent pursuit in matrix world (pursuitMW) is a partially observable Markov game (POMG) between a swarm of pursuers and a swarm of evaders. Algorithms can be developed for the pursuers, evaders, or both of them.

1 papers0 benchmarksEnvironment

Harmonized US National Health and Nutrition Examination Survey (NHANES) 1988-2018

The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-20

1 papers0 benchmarksBiomedical, Environment, Tabular

Complete data from the Barro Colorado 50-ha plot: 423617 trees, 35 years

The 50-ha plot at Barro Colorado Island was initially demarcated and fully censused in 1982, and has been fully censused 7 times since, every 5 years from 1985 through 2015. Every measurement of every stem over 8 censuses is included in this archive. Most users will need only the 8 R Analytical Tables in the format tree, which come here zipped together into a single archive (bci.tree.zip), plus the single R Species Table.

1 papers0 benchmarksEnvironment

PushWorld

PushWorld is an environment with simplistic physics that requires manipulation planning with both movable obstacles and tools. It contains more than 200 PushWorld puzzles in PDDL and in an OpenAI Gym environment.

1 papers0 benchmarksEnvironment

Sonicverse

Sonicverse is a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-Visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception.

1 papers0 benchmarksEnvironment

RoseBlooming-Dataset

The RoseBlooming dataset is a stage-specific flower dataset for detection. The dataset, consisting of overhead images, contains two rose cultivars and was filmed over a period of months. The dataset has 519 images, and most of the images contain several bounding boxes. Therefore, this dataset contains over 7,000 bounding boxes. The developmental stages of flowering branches were visually classified and annotated into two stages: rose_small, and rose_large. For the rose variation, the dataset includes 2 rose cultivars (‘Samourai 08’ and ‘Blossom Pink’ roses). The dataset contains images under various weather conditions.

1 papers0 benchmarksEnvironment, Images

UIUC Scooping Dataset (Granular Materials Manipulation Dataset with Scooping/Digging/Excavation Action)

Overview: This dataset encompasses a compilation of 6,700 executed scoops (excavations), mapped across a vast spectrum of materials, terrain topography, and compositions.

1 papers0 benchmarksEnvironment, Images, Point cloud, RGB-D, Time series

BurnMD (A Fire Projection and Mitigation Modeling Dataset)

A dataset composed of 308 medium sized fires from the years 2018-2021, complete with both time series airborne based inference and ground operational estimation of fire extent, and operational mitigation data such as control line construction.

1 papers0 benchmarksEnvironment, Time series

SICKLE (Satellite Imagery for Cropping annotated with Keyparameter LabEls)

The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation data in agriculture, there is a scarcity of curated and labelled datasets, which limits the potential of its use in training ML models for remote sensing (RS) in agriculture. To this end, we introduce a first-of-its-kind dataset called SICKLE, which constitutes a time-series of multi-resolution imagery from 3 distinct satellites: Landsat-8, Sentinel-1 and Sentinel-2. Our dataset constitutes multi-spectral, thermal and microwave sensors during January 2018 - March 2021 period. We construct each temporal sequence by considering the cropping practices followed by farmers primarily engaged in paddy cultivation in the Cauvery Delta region of Tamil Nadu, India; and annotate the corresponding imagery with key cropping parameters at multiple resolutions (i.e. 3m, 10m and 30m). Our dataset comprises 2, 370 season-wise samples from 388 unique plots, having

1 papers1 benchmarksEnvironment, Images, Time series

Pedestrian Evacuation Optimization

Optimization of pedestrian evacuation in different environments

1 papers0 benchmarksEnvironment

JAMBO (A Multi-Annotator Image Dataset for Benthic Habitat Classification)

The JAMBO dataset contains 3290 underwater images of the seabed captured by an ROV in temperate waters in the Jammer Bay area off the North West coast of Jutland, Denmark. All the images have been annotated by six annotators to contain one of three classes: sand, stone, or bad.

1 papers0 benchmarksEnvironment, Images

Social-HM3D

The scene derives from photo-realistic HM3D datasets. Our dataset offers a wide variety of environments especially for Social Navigation tasks, with carefully calibrated human density, incorporating realistic human motions and natural movement patterns. These features ensure balanced interaction dynamics across diverse scenes, facilitating the development of more effective social navigation algorithms.

1 papers0 benchmarksEnvironment

Social-MP3D

The scene derives from photo-realistic MP3D datasets. Our dataset offers a wide variety of environments especially for Social Navigation tasks, with carefully calibrated human density, incorporating realistic human motions and natural movement patterns. These features ensure balanced interaction dynamics across diverse scenes, facilitating the development of more effective social navigation algorithms.

1 papers0 benchmarksEnvironment

MAX-60K (Masked Autoencoder for X-ray Fluorescence 60K Dataset)

The dataset for masked autoencoder for X-ray fluorescence (XRF) is a following development after the dataset (Chao et al., 2022). Besides the published XRF spectra-target measurements (CaCO3 and TOC) pairs of data, we further upload the XRF spectra in that project but without alignments of the target measurements here. As the first XRF large dataset compiled in a ML friendly format, we expect to kickoff more ML studies in the field of XRF and geology, especially DL studies.

1 papers0 benchmarksEnvironment

WiFiCam

WiFiCam dataset for through-wall imaging based on WiFi channel state information. The corresponding source code repository is located at: https://github.com/StrohmayerJ/wificam

1 papers0 benchmarksEnvironment, Images, RGB Video, Time series

EGO-CH-Gaze (Learning to Detect Attended Objects in Cultural Sites with Gaze Signals and Weak Object Supervision)

To study the problem of weakly supervised attended object detection in cultural sites, we collected and labeled a dataset of egocentric images acquired from subjects visiting a cultural site. The dataset has been designed to offer a snapshot of the subject’s visual experience while visiting a museum and contains labels for several artworks and details attended by the subjects.

1 papers0 benchmarksEnvironment, Images, Videos

Plancraft

An evaluation dataset for planning with LLM agents

1 papers0 benchmarksEnvironment, Images, Texts

BIRDeep (BIRDeep_AudioAnnotations)

The BIRDeep Audio Annotations dataset is a collection of bird vocalizations from Doñana National Park, Spain. It was created as part of the BIRDeep project, which aims to optimize the detection and classification of bird species in audio recordings using deep learning techniques. The dataset is intended for use in training and evaluating models for bird vocalization detection and identification.

1 papers0 benchmarksAudio, Biology, Environment, Images

DARai (Daily Activity Recordings for AI and ML applications)

Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3

1 papers0 benchmarksBiomedical, Environment, Images, LiDAR, RGB-D, Time series, Videos
PreviousPage 7 of 8Next