19,997 machine learning datasets
19,997 dataset results
Egoshots is a 2-month Ego-vision Dataset with Autographer Wearable Camera annotated "for free" with transfer learning. Three state of the art pre-trained image captioning models are used. The dataset represents the life of 2 interns while working at Philips Research (Netherlands) (May-July 2015) generously donating their data.
An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.
This dataset consists of 3,710 flood images, annotated by domain experts regarding their relevance with respect to three tasks (determining the flooded area, inundation depth, water pollution).
The Freebase Annotations of TREC KBA 2014 Stream Corpus with Timestamps (FAKBAT) is an extension of the FAKBA1 dataset that contains entity age and entity timestamp. It comprises roughly 1.2 billion timestamped documents from global public news wires, blogs, forums, and shortened links shared on social media. It spans 572 days (October 7, 2011–May 1, 2013).
A benchmark for detecting fallen people lying on the floor. It consists of 6982 images, with a total of 5023 falls and 2275 non falls corresponding to people in conventional situations (standing up, sitting, lying on the sofa or bed, walking, etc). Almost all the images have been captured in indoor environments with very different situations: variation of poses and sizes, occlusions, lighting changes, etc.
A 360-degree fisheye-like version of the popular FDDB face detection dataset.
This dataset enriches the benchmark Room-to-Room (R2R) dataset by dividing the instructions into sub-instructions and pairing each of those with their corresponding viewpoints in the path. The overall instruction and trajectory of each sample remains the same.
FinnSentiment introduces a 27,000 sentence dataset (in Finnish) annotated independently with sentiment polarity by three native annotators.
This dataset is dialog dataset collected in a Wizard-of-Oz fashion. Two humans talked to each other via a chat interface. One was playing the role of the user and the other one was playing the role of the conversational agent. The latter is called a wizard as a reference to the Wizard of Oz, the man behind the curtain. The wizards had access to a database of 250+ packages, each composed of a hotel and round-trip flights. The users were asked to find the best deal. This resulted in complex dialogues where a user would often consider different options, compare packages, and progressively build the description of her ideal trip.
A large-scale and accurate dataset for vision-based railway traffic light detection and recognition.The recordings were made on selected running trains in France and benefited from carefully hand-labeled annotations.
A dataset containing 2,221 questions from matriculation exams for twelfth grade in various subjects -history, biology, geography and philosophy-, and 412 additional questions from online quizzes in history.
The Horne 2017 Fake News Data contains two independed news datasets:
A verified-by-experts repository of 3050 human rights violations photographs, labelled with human rights semantic categories, comprising a list of the types of human rights abuses encountered at present.
The Human-Parts dataset is a dataset for human body, face and hand detection with ~15k images. It contains ~106k different annotations, with multiple annotations per image.
Icons-50 is a dataset for studying surface variation robustness.
The Image-MusicEmotion-Matching-Net (IMEMNet) dataset is a dataset for continuous emotion-based image and music matching. It has over 140K image-music pairs.
Includes two datasets published for the detection of fake and automated accounts.
A large-scale evaluation dataset for headlines of three different lengths composed by professional editors.
This dataset contains information about Japanese word similarity including rare words. The dataset is constructed following the Stanford Rare Word Similarity Dataset. 10 annotators annotated word pairs with 11 levels of similarity.
The Jejueo Interview Transcripts (JIT) dataset is a parallel corpus containing 170k+ Jejueo-Korean sentences.