19,997 machine learning datasets
19,997 dataset results
Egocentric Dataset of the University of Barcelona – Segmentation (EDUB-Seg) is a dataset for egocentric event segmentation acquired by the Narrative Clip, which takes a picture every 30 seconds. The dataset contains a total of 18,735 images captured by 7 different users during overall 20 days. To ensure diversity, all users were wearing the camera in different contexts: while attending a conference, on holiday, during the weekend, and during the week.
48k question-answer pairs written in rich natural language.
The endoscopic SLAM dataset (EndoSLAM) is a dataset for depth estimation approach for endoscopic videos. It consists of both ex-vivo and synthetically generated data. The ex-vivo part of the dataset includes standard as well as capsule endoscopy recordings. The dataset is divided into 35 sub-datasets. Specifically, 18, 5 and 12 sub-datasets exist for colon, small intestine and stomach respectively.
ESAD is a large-scale dataset designed to tackle the problem of surgeon action detection in endoscopic minimally invasive surgery. ESAD aims at contributing to increase the effectiveness and reliability of surgical assistant robots by realistically testing their awareness of the actions performed by a surgeon. The dataset provides bounding box annotation for 21 action classes on real endoscopic video frames captured during prostatectomy, and was used as the basis of a recent MIDL 2020 challenge.
The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes:
Builds upon the event-centric EventKG knowledge graph and language-specific information on user interactions with events, entities, and their relations derived from the Wikipedia clickstream.
Finnish News Corpus for Named Entity Recognition (Finer) is a corpus that consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event,and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source.
Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, validation, and test datasets, these pairs were sorted by time and the first 52,400 pairs are used as the training dataset, the proceeding 5,265 pairs as the validation dataset, and the remaining 5,770 pairs as the test dataset.
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
German Guideline Program in Oncology NLP Corpus (GGPONC) is a German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions.
The data simulate simple consumer banking interactions, containing about 23 hours of audio from 1,446 human-human conversations between 59 unique speakers.
Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.
Houses3K is a dataset of 3000 textured 3D house models. Houses3K is divided into twelve batches, each containing 50 unique house geometries. For each batch, five different textures were applied forming the sets (A, B, C, D, E).
Propose a dataset which adopts multi-channel visual input.
Public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats.
A new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.
A dataset for inter-personal relationship extraction which aims to facilitate information extraction and knowledge graph construction research. In total, IPRE has over 41,000 labeled sentences for 34 types of relations, including about 9,000 sentences annotated by workers.
This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.
The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.
Kitchen Scenes is a multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud.