19,997 machine learning datasets
19,997 dataset results
A collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding).
Occ-Traj120 is a trajectory dataset that contains occupancy representations of different local-maps with associated trajectories. This dataset contains 400 locally-structured maps with occupancy representation and roughly around 120K trajectories in total.
The OCR-VQA dataset is a valuable resource for research in the field of Visual Question Answering (VQA). Let me provide you with some details about it:
ODMS is a dataset for learning Object Depth via Motion and Segmentation. ODMS training data are configurable and extensible, with each training example consisting of a series of object segmentation masks, camera movement distances, and ground truth object depth. As a benchmark evaluation, the dataset provides four ODMS validation and test sets with 15,650 examples in multiple domains, including robotics and driving.
A realistic, diverse, and challenging dataset for object detection on images. The data was recorded at a beer tent in Germany and consists of 15 different categories of food and drink items.
(L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition Dataset (OpenLORIS-Object) is designed for accelerating the lifelong/continual/incremental learning research and application,currently focusing on improving the continuous learning capability of the common objects in the home scenario.
A new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze.
The Parallel Meaning Bank (PMB), developed at the University of Groningen and building upon the Groningen Meaning Bank, comprises sentences and texts in raw and tokenised format, syntactic analysis, word senses, thematic roles, reference resolution, and formal meaning representations. The main objective of the PMB is to provide fine-grained meaning representations for words, sentences and texts. Sentences are, in isolation, often ambiguous. The aim is to provide the most likely interpretation for a sentence, with a minimal use of underspecification.
The data includes all movement trajectories extracted from the videos of Parkinson's assessments using Convolutional Pose Machines (CPM) as well as the confidence values from CPM. The dataset also includes ground truth ratings of parkinsonism and dyskinesia severity using the UDysRS, UPDRS, and CAPSIT.
A new benchmark dataset of webcam images, Photi-LakeIce, from multiple cameras and two different winters, along with pixel-wise ground truth annotations.
The pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language. The automatically generated corpus is generated from Wikipedia. The gold-standard set is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages.
A dataset containing the documents, source and fusion sentences, and human annotations of points of correspondence between sentences. The dataset bridges the gap between coreference resolution and summarization.
PoKi is a corpus of 61,330 poems written by children from grades 1 to 12. PoKi is especially useful in studying child language because it comes with information about the age of the child authors (their grade).
Collects five open polarimetric SAR images, which are images of the San Francisco area. These five images come from different satellites at different times, which has great scientific research value.
The PubFig database is a large, real-world face dataset consisting of 58,797 images of 200 people collected from the internet. Unlike most other existing face datasets, these images are taken in completely uncontrolled situations with non-cooperative subjects. Thus, there is large variation in pose, lighting, expression, scene, camera, imaging conditions and parameters, etc. The PubFig dataset is similar in spirit to the Labeled Faces in the Wild (LFW) dataset.
The Pump and dump dataset is an annotated set of messages to detect cryptocurrency market manipulations. It consists of a list of a list of pump and dumps arranged by groups on Telegram. All the pump and dumps in the dataset are on the trading pair SYM/BTC.
The dataset is useful for query-adaptive video summarization and annotated with diversity and query-specific relevance labels.
RainNet is a real (non-simuated) large-scale spatial precipitation downscaling dataset that contains 62,424 pairs of low-resolution and high-resolution precipitation maps for 17 years. Contrary to simulated data, this real dataset covers various types of real meteorological phenomena (e.g., Hurricane, Squall, etc.), and shows the physical characters - Temporal Misalignment, Temporal Sparse and Fluid Properties - that challenge the downscaling algorithms.
Rendered Handpose Dataset contains 41258 training and 2728 testing samples. Each sample provides:
ReviewQA is a question-answering dataset based on hotel reviews. The questions of this dataset are linked to a set of relational understanding competencies that a model is expected to master. Indeed, each question comes with an associated type that characterizes the required competency.