3,148 machine learning datasets
3,148 dataset results
The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).
The Reddit Photo Critique Dataset (RPCD) contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback.
This data set is being released to support the spam and context-specific spam detection tasks on Twitter data.
This dataset links all the entries describing named entities of Petit Larousse illustré, a French dictionary published in 1905, to wikidata identifiers. The dataset is available in the JSON format as a list of entries, where each entry is a dictionary with two keys: the text of the entry and the list of wikidata identifiers. For example, for the entry AALI-PACHA: {'texte': "AALI-PACHA, homme d'Etat turc, né à Constantinople. Il a attaché son nom à la politique de réformes du Tanzimat (1815-1871).", 'qid': ['Q439237']}
This dataset contains the data used for all statistical analysis in our publication "Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering", summarised in a single .csv file.
Medical VQA dataset built from the IDRiD and eOphta datasets. The dataset contains both healthy and unhealthy fundus images. For each image, a set of pre-defined questions is generated, including questions about regions (e.g. are there hard exudates in this region?), for which an associated mask denotes the location of the region.
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).
Text from Irish Wikipedia, an online encyclopedia.
This dataset accompanies the ICWSM 2022 paper "Mapping Topics in 100,000 Real-Life Moral Dilemmas".
In this Adjudicator Scores_Short Stories and Written Reflections folder: Four files from four student participants of the contest. Each file contains
In this Coding and Coding Scheme spreadsheet: Student answers to reflection questions from pre-context workshops; coding scheme for student reflections; and coding of student reflection
In this Pre-Contest Workshop Slidedeck.pdf: Instructional materials delivered for the seven pre-contest workshops
Includes co-referent name string pairs along with their similarities.
The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.
This dataset contains dialogue lines from the games Knights of the Old Republic 1 & 2 and Neverwinter Nights 1. Some of the dialogue lines are marked as persuasive (which is when the player character is attempting a Persuade skill check.)
PDDL dataset of Rearrangement tasks in large-scale 3D scene graphs.
MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464 images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.
The Mafia Dataset was created to model the behavior of deceptive actors in the context of the Mafia game, as described in the paper “Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia”. We hope that this dataset will be of use to others studying the effects of deception on language use.
ANTILLES is a part-of-speech tagging corpus based on UD_French-GSD which was originally created in 2015 and is based on the universal dependency treebank v2.0.
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.