19,997 machine learning datasets
19,997 dataset results
The MNIST Large Scale dataset is based on the classic MNIST dataset, but contains large scale variations up to a factor of 16. The motivation behind creating this dataset was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over wide scale ranges.
VANiLLa is a dataset for Question Answering over Knowledge Graphs (KGQA) offering answers in natural language sentences. The answer sentences in this dataset are syntactically and semantically closer to the question than to the triple fact. The dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets and generated using a semi-automatic framework.
JobStack is a new corpus for de-identification of personal data in job vacancies on Stackoverflow. De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data.
We create ARD-16 (Ati Realworld Dataset), a first of its kind real-world paired correspondence dataset, by applying our dataset generation method on 16-beam VLP-16 Puck LiDAR scans on a slow-moving Unmanned Ground Vehicle. We obtain ground truth poses by using fine resolution brute force scan matching, similar to Google's Cartographer. It was captured in outdoor environment at Robert Bosch centre, IISc with no moving objects during static run and several moving objects (1 car, 1 2-wheeler, few pedestrians) during dynamic run. It consists of 1.5k scans/run and we collected 10 dynamic and 5 static runs. This gives about 14k LiDAR scan pairs for training, validation and testing.
We create 64-beam LiDAR dataset with settings similar to Velodyne VLP-64 LiDAR on the CARLA simulator. It contains no moving objects during static run and several moving objects (cars, 2-wheelers, pedestrians) during dynamic runs. It consists of 16 dynamic runs and 8 static runs. This gives about 32k LiDAR scan pairs for training, validation ad tesing.
Relatedness judgments of ambiguous English words, in experimentally controlled sentential contexts.
CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.
FacetSum is a faceted summarization dataset for scientific documents. FacetSum has been built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value.
CrackForest Dataset is an annotated road crack image database which can reflect urban road surface condition in general.
Topo-boundary is a new benchmark dataset, named \textit{Topo-boundary}, for off-line topological road-boundary detection. The dataset contains 21,556 1000 X 1000-sized 4-channel aerial images. Each image is provided with 8 training labels for different sub-tasks.
The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in the NLP4Prog workshop paper "Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation". The key additions are that every example now has the full question body from its respective StackOverflow Question.
Inspiratory and exipratory breath-hold CT image pairs acquired from the National Heart Lung Blood Institute COPDgene study archive.
Children's Song Dataset is open source dataset for singing voice research. This dataset contains 50 Korean and 50 English songs sung by one Korean female professional pop singer. Each song is recorded in two separate keys resulting in a total of 200 audio recordings. Each audio recording is paired with a MIDI transcription and lyrics annotations in both grapheme-level and phoneme-level.
The largest and most realistic dataset available for TCC. It consists of 600 real-world videos recorded with a high-resolution mobile phone camera shooting 1824 x 1368 sized pictures. The length of these videos ranges from 3 to 17 frames (7.3 on average, the median is 7.0 and mode is 8.5). Ground truth information is present only for the last frame in each video (i.e., the shot frame), and was collected using a gray surface calibration target.
CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in \{1, 0\}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in \{0, 1\}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.
Solar Power Data for Integration Studies NREL's Solar Power Data for Integration Studies are synthetic solar photovoltaic (PV) power plant data points for the United States representing the year 2006.
ZhihuRec dataset is collected from a knowledge-sharing platform (Zhihu), which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query logs. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.
Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.
The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a single direction-of-arrival (DoA) if static, a trajectory DoAs if moving, and a temporal onset and offset time. The isolated sound event recordings used for t
The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 diferent writers. We utilized three different datasets described as following: