Datasets

19,997 machine learning datasets

19,997 dataset results

USR-TopicalChat

This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020), the authors collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as their newly proposed USR metric.

7 papers4 benchmarksTexts

USR-PersonaChat

7 papers4 benchmarksTexts

TransCG

TransCG is the first large-scale real-world dataset for transparent object depth completion and grasping, which contains 57,715 RGB-D images of 51 transparent objects and many opaque objects captured from different perspectives (~240 viewpoints) of 130 scenes under real-world settings. The samples are captured by two different types of cameras (Realsense D435 & L515).

7 papers18 benchmarksRGB-D

EvoGym (Evolution Gym)

EvoGym is a large-scale benchmark for co-optimizing the design and control of soft robots.

7 papers0 benchmarksEnvironment

RefSeer

A data set containing citations, citation contexts, and papers.

7 papers0 benchmarksTexts

NOAA Atmospheric Temperature Dataset

This dataset contains meteorological observations (temperature) at the land-based weather stations located in the United States, collected from the Online Climate Data Directory of the National Oceanic and Atmospheric Administration (NOAA). The weather stations are sampled from the Western and Southeastern states that have actively measured meteorological observations during 2015. The 1-year sequential data of hourly temperature records are divided into small sequences of 24 hours. For training, validation, and test a sequential 8-2-2 (months) split is used.

7 papers2 benchmarks

PET (PET: A new Dataset for Process Extraction from Natural Language Text)

The dataset contains 45 documents containing narrative description of business process and their annotations. Annotated with activities, gateways, actors, and flow information.

7 papers0 benchmarksTexts

ChangeIt

ChangeIt dataset with more than 2600 hours of video with state-changing actions published at CVPR 2022.

7 papers0 benchmarks

BigDetection

BigDetection is a new large-scale benchmark to build more general and powerful object detection systems. It leverages the training data from existing datasets (LVIS, OpenImages and Object365) with carefully designed principles, and curate a larger dataset for improved detector pre-training. BigDetection dataset has 600 object categories and contains 3.4M training images with 36M object bounding boxes.

7 papers0 benchmarksImages

New3

New3, a set of 527 instances from AMR 3.0, whose original source was the LORELEI DARPA project – not included in the AMR 2.0 training set – consisting of excerpts from newswires and online forum.

7 papers2 benchmarksGraphs, Texts

SF-XL test v1 (San Francisco eXtra Large test v1)

Test set version 1 for the San Francisco eXtra Large dataset

7 papers3 benchmarks

PDNC (Project Dialogism Novel Corpus)

A annotated dataset of quotations and within-quotation-mentions in 22 full-length English novels.

7 papers0 benchmarksTexts

FLAT

FLAT, a synthetic dataset of 2000 ToF measurements that capture all of these nonidealities, and can be used to simulate different hardware

7 papers0 benchmarks

SUES-200

Cross-view Image Dataset Across Drone and Satellite - multi-height - multi-scene

7 papers0 benchmarks

Bamboo

Bamboo Dataset is a mega-scale and information-dense dataset for both classification and detection pre-training. It is built upon integrating 24 public datasets (e.g. ImagenNet, Places365, Object365, OpenImages) and added new annotations through active learning. Bamboo has 69M image classification annotations and 32M object bounding boxes.

7 papers0 benchmarksImages

BS-RSC

BS-RSC is a real-world rolling shutter (RS) correction dataset and a corresponding model to correct the RS frames in a distorted video. Real distorted videos with corresponding ground truth are recorded simultaneously via a well-designed beam-splitter-based acquisition system. BSRSC contains various motions of both camera and objects in dynamic scenes.

7 papers2 benchmarksVideos

CVRPTW

Random sampled instances of the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) for 20, 50 and 100 customer nodes.

7 papers0 benchmarksEnvironment, Graphs

SEN12MS-CR-TS

SEN12MS-CR-TS is a multi-modal and multi-temporal data set for cloud removal. It contains time-series of paired and co-registered Sentinel-1 and cloudy as well as cloud-free Sentinel-2 data from European Space Agency's Copernicus mission. Each time series contains 30 cloudy and clear observations regularly sampled throughout the year 2018. Our multi-temporal data set is readily pre-processed and backward-compatible with SEN12MS-CR.

7 papers8 benchmarksHyperspectral images, Images, Time series

PMData

The PMData dataset aims to combine the traditional lifelogging with sports activity logging.

7 papers0 benchmarks

Flickr30k-CNA (Flickr30k-Chinese All)

Former Flickr30k-CN translates the training and validation sets of Flickr30k using machine translation and manually translates the test set. We check the machine-translated results and find two kinds of problems. (1) Some sentences have language problems and translation errors. (2) Some sentences have poor semantics. In addition, the different translation ways between the training set and test set prevent the model from achieving accurate performance. We gather 6 professional English and Chinese linguists to meticulously re-translate all data of Flickr30k and double-check each sentence.

7 papers0 benchmarks

PreviousPage 187 of 1000Next