19,997 machine learning datasets
19,997 dataset results
Talk2Nav is a large-scale dataset with verbal navigation instructions.
The TaoDescribe dataset contains 2,129,187 product titles and descriptions in Chinese.
A new text effects dataset with 141,081 text effect/glyph pairs in total. The dataset consists of 152 professionally designed text effects rendered on glyphs, including English letters, Chinese characters, and Arabic numerals.
A movie ticketing dialog dataset with 23,789 annotated conversations. The movie ticketing conversations range from completely open-ended and unrestricted to more structured, both in terms of their knowledge base, discourse features, and number of turns. In qualitative human evaluations, model-generated responses trained on just 10,000 TicketTalk dialogs were rated to "make sense" 86.5 percent of the time, almost the same as human responses in the same contexts.
Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.
TOP is a synthetic dataset for topology optimization generated using Topy. The generated dataset has 10,000 objects which consist on 100 iterations of the optimization process for the problem defined on a regular 40 x 40 grid.
The dataset has 10.5 hours from a single speaker.
Twitch-FIFA is video-context, many-speaker dialogue dataset based on live-broadcast soccer game videos and chats from Twitch.tv. This dataset can be used to train visually-grounded dialogue models that generate relevant temporal and spatial event language from the live video, while also being relevant to the chat history.
This dataset is used for the task of conversational document prediction. The dataset includes conversations that occurred between users and customer care agents in 25 organizations on the Twitter platform. Each conversation ends with a customer care agent providing a URL to a document to resolve the issue the user is facing. The task is to predict the document given a dialog context. The train, dev and test datasets include 10000, 525 and 500 conversations respectively.
~6 million synthetic depth frames for pose estimation from multiple cameras.
40,764 images (11,659 protest images and hard negatives) with various annotations of visual attributes and sentiments.
This dataset contains 2,000 images taken from inside a warehouse of the Energy Company of Paraná (Copel), which directly serves more than 4 million consuming units in the Brazilian state of Paraná.
The Virtual Gallery dataset is a synthetic dataset that targets multiple challenges such as varying lighting conditions and different occlusion levels for various tasks such as depth estimation, instance segmentation and visual localization.
The Vistas-NP dataset is an out-of-distribution detection dataset based on the Mapillary Vistas dataset. The original Vistas dataset consists of 18,000 training images and 2,000 validation images with 66 classes. In Vistas-NP the human classes are used as outliers due to their dispersion across scenes and visual diversity from other objects. The dataset is created by excluding all images with class person and the three rider classes to the test subset. Consequently, the dataset has 8,003 train images and 830 validation images. The test set contains 11,167.
A dataset containing 5000 images with 37,993 thousand relationships. The dataset contains 100 object categories and 70 predicate categories connecting those objects together.
ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.
VizWiz-Priv includes 8,862 regions showing private content across 5,537 images taken by blind people. Of these, 1,403 are paired with questions and 62% of those directly ask about the private content.
A large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering.
Ward2ICU is a vital signs dataset of inpatients from the general ward. It contains vital signs with class labels indicating patient transitions from the ward to intensive care units
A newly developed public dataset and the task of multiple property extraction. It uses the same data as WikiReading but does not inherit its predecessor's identified disadvantages.