19,997 machine learning datasets
19,997 dataset results
This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020), the authors collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as their newly proposed USR metric.
This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020), the authors collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as their newly proposed USR metric.
TransCG is the first large-scale real-world dataset for transparent object depth completion and grasping, which contains 57,715 RGB-D images of 51 transparent objects and many opaque objects captured from different perspectives (~240 viewpoints) of 130 scenes under real-world settings. The samples are captured by two different types of cameras (Realsense D435 & L515).
EvoGym is a large-scale benchmark for co-optimizing the design and control of soft robots.
A data set containing citations, citation contexts, and papers.
This dataset contains meteorological observations (temperature) at the land-based weather stations located in the United States, collected from the Online Climate Data Directory of the National Oceanic and Atmospheric Administration (NOAA). The weather stations are sampled from the Western and Southeastern states that have actively measured meteorological observations during 2015. The 1-year sequential data of hourly temperature records are divided into small sequences of 24 hours. For training, validation, and test a sequential 8-2-2 (months) split is used.
The dataset contains 45 documents containing narrative description of business process and their annotations. Annotated with activities, gateways, actors, and flow information.
ChangeIt dataset with more than 2600 hours of video with state-changing actions published at CVPR 2022.
BigDetection is a new large-scale benchmark to build more general and powerful object detection systems. It leverages the training data from existing datasets (LVIS, OpenImages and Object365) with carefully designed principles, and curate a larger dataset for improved detector pre-training. BigDetection dataset has 600 object categories and contains 3.4M training images with 36M object bounding boxes.
New3, a set of 527 instances from AMR 3.0, whose original source was the LORELEI DARPA project – not included in the AMR 2.0 training set – consisting of excerpts from newswires and online forum.
Test set version 1 for the San Francisco eXtra Large dataset
A annotated dataset of quotations and within-quotation-mentions in 22 full-length English novels.
FLAT, a synthetic dataset of 2000 ToF measurements that capture all of these nonidealities, and can be used to simulate different hardware
Cross-view Image Dataset Across Drone and Satellite - multi-height - multi-scene
Bamboo Dataset is a mega-scale and information-dense dataset for both classification and detection pre-training. It is built upon integrating 24 public datasets (e.g. ImagenNet, Places365, Object365, OpenImages) and added new annotations through active learning. Bamboo has 69M image classification annotations and 32M object bounding boxes.
BS-RSC is a real-world rolling shutter (RS) correction dataset and a corresponding model to correct the RS frames in a distorted video. Real distorted videos with corresponding ground truth are recorded simultaneously via a well-designed beam-splitter-based acquisition system. BSRSC contains various motions of both camera and objects in dynamic scenes.
Random sampled instances of the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) for 20, 50 and 100 customer nodes.
SEN12MS-CR-TS is a multi-modal and multi-temporal data set for cloud removal. It contains time-series of paired and co-registered Sentinel-1 and cloudy as well as cloud-free Sentinel-2 data from European Space Agency's Copernicus mission. Each time series contains 30 cloudy and clear observations regularly sampled throughout the year 2018. Our multi-temporal data set is readily pre-processed and backward-compatible with SEN12MS-CR.
The PMData dataset aims to combine the traditional lifelogging with sports activity logging.
Former Flickr30k-CN translates the training and validation sets of Flickr30k using machine translation and manually translates the test set. We check the machine-translated results and find two kinds of problems. (1) Some sentences have language problems and translation errors. (2) Some sentences have poor semantics. In addition, the different translation ways between the training set and test set prevent the model from achieving accurate performance. We gather 6 professional English and Chinese linguists to meticulously re-translate all data of Flickr30k and double-check each sentence.