Datasets

19,997 machine learning datasets

19,997 dataset results

PolyDensity (Polymer Density)

The PolyDensity is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of polymer density.

2 papers0 benchmarks

Motion Blurred and Defocused Dataset (datacluster.ai)

This dataset consists of blurred, noisy and defocused images.

2 papers0 benchmarksImages

The manifest and store data of 870,515 Android mobile applications

Involves a crawler to collect data from the Google Play store including the application's metadata and APK files. The manifest files were extracted from the APK files and then processed to extract the features. The data set is composed of 870,515 records/apps, and for each app we produced 48 features. The data set was used to built and test two bootstrap aggregating of multiple XGBoost machine learning classifiers. The dataset were collected between April 2017 and November 2018. We then checked the status of these applications on three different occasions; December 2018, February 2019, and May-June 2019. (2022-06-03)

2 papers0 benchmarks

Summaries of genetic variation

The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).

2 papers0 benchmarksBiology, Tabular

DAST (Danish Stance)

This is an SDQC stance-annotated Reddit dataset for the Danish language generated within a thesis project. The dataset consists of over 5000 comments structured as comment trees and linked to 33 source posts.

2 papers0 benchmarksTexts

OADAT (OADAT: Experimental and Synthetic Clinical Optoacoustic Data for Standardized Image Processing)

An experimental and synthetic (simulated) OA raw signals and reconstructed image domain datasets rendered with different experimental parameters and tomographic acquisition geometries.

2 papers0 benchmarksImages, Medical

Hotel (Hospitality > Tourism > Hotel Demand/Sales)

The dataset contains the hotel demand and revenue of 8 major tourist destinations in the US (e.g., Los Angeles, Orlando ...). The dataset contains sales, daily occupancy, demand, and revenue of the upper-middle class hotels.

2 papers0 benchmarksTabular, Time series

5,011 Images – Human Frontal face Data (Male)

Description： 5,011 Images – Human Frontal face Data (Male). The data diversity includes multiple scenes, multiple ages and multiple races. This dataset includes 2,004 Caucasians , 3,007 Asians. This dataset can be used for tasks such as face detection, race detection, age detection, beard category classification.

2 papers0 benchmarksImages

BN-HTRd (BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla Handwritten Text Recognition (HTR))

We introduce a new Dataset (BN-HTRd) for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations. The BN-HTRd dataset is based on the BBC Bangla News corpus - which acted as ground truth texts for the handwritings. Our dataset contains a total of 786 full-page images collected from 150 different writers. With a staggering 1,08,181 instances of handwritten words, distributed over 14,383 lines and 23,115 unique words, this is currently the 'largest and most comprehensive dataset' in this field. We also provided the bounding box annotations (YOLO format) for the segmentation of words/lines and the ground truth annotations for full-text, along with the segmented images and their positions. The contents of our dataset came from a diverse news category, and annotators of different ages, genders, and backgrounds, having variability in writing styles. The BN-HTRd dataset can be adopted as a basis for various handwriting c

2 papers2 benchmarks

NovelCraft

Scene-focused, multi-modal, episodic data of the images and symbolic world-states seen by an agent completing a pogo-stick assembly task within a video game world. Classes consist of episodes with novel objects inserted. A subset of these novel objects can impact gameplay and agent behavior. Novelty objects can vary in size, position, and occlusion within the images. Usable for novelty detection, generalized category discovery, and class-imbalanced classification.

2 papers0 benchmarksImages, Texts

IBISCape

A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments.

2 papers0 benchmarksEnvironment, Images, Point cloud, RGB Video, RGB-D, Stereo, Videos

Active TLS Stack Fingerprinting Measurement Data

Measurement data related to the publication „Active TLS Stack Fingerprinting: Characterizing TLS Server Deployments at Scale“. It contains weekly TLS and HTTP scan data and the TLS fingerprints for each target.

2 papers0 benchmarksTabular

NCI (New Corpus for Ireland)

Contains a wide range of texts in Irish, including fiction, news reports, informative texts and official documents.

2 papers0 benchmarksTexts

AOM-CTC

This is the Current Video sequence set from the AOM-CTC.

2 papers0 benchmarks

DaNewsroom (DaNewsroom: A Large-scale Danish Summarisation Dataset)

The first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language.

2 papers0 benchmarks

GPI corpus (Government Privacy Instructions Corpus)

The GPI Corpus is a collection of 1,043 privacy laws, regulations, and guidelines ("GPIs") covering 182 jurisdictions around the world. These documents are provided in two file formats (i.e., PDF showing the original formatting on the source website and TXT containing just the text of the GPI) and, in some cases, in multiple languages (i.e., the original language(s) and an English translation).

2 papers0 benchmarksTexts

VNDS (VNDS: A Vietnamese Dataset for Summarization)

A single-document Vietnamese summarization dataset

2 papers0 benchmarks

SKIPP'D

Large-scale integration of photovoltaics (PV) into electricity grids is challenged by the intermittent nature of solar power. Sky-image-based solar forecasting using deep learning has been recognized as a promising approach to predicting the short-term fluctuations. However, there are few publicly available standardized benchmark datasets for image-based solar forecasting, which limits the comparison of different forecasting models and the exploration of forecasting methods. To fill these gaps, we introduce SKIPP'D -- a SKy Images and Photovoltaic Power Generation Dataset. The dataset contains three years (2017-2019) of quality-controlled down-sampled sky images and PV power generation data that is ready-to-use for short-term solar forecasting using deep learning. In addition, to support the flexibility in research, we provide the high resolution, high frequency sky images and PV power generation data as well as the concurrent sky video footage. We also include a code base containing d

2 papers0 benchmarks

UMLS-43

UMLS-43 is a variant of the UMLS knowledge graph that is robust to data leakage through inverse relations. It has been derived by removing three edge types that should be considered problematic by Dettmers' definition: 'degree_of', 'precedes', and 'derivative_of'. It is presented here* as a .tsv edgelist, such that each line represents one edge in the (head, relation, tail) format.

2 papers0 benchmarks

MOTFront

MOTFront provides photo-realistic RGB-D images with their corresponding instance segmentation masks, class labels, 2D & 3D bounding boxes, 3D geometry, 3D poses and camera parameters. The MOTFront dataset comprises 2,381 unique indoor sequences with a total of 60,000 images and is based on the 3D-FRONT dataset.

2 papers0 benchmarks

PreviousPage 326 of 1000Next