Datasets

3,148 machine learning datasets

3,148 dataset results

EVI

The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).

1 papers0 benchmarksDialog, Speech, Tabular, Texts

RPCD (Reddit Photo Critique Dataset)

The Reddit Photo Critique Dataset (RPCD) contains tuples of image and photo critiques. RPCD consists of 74K images and 220K comments and is collected from a Reddit community used by hobbyists and professional photographers to improve their photography skills by leveraging constructive community feedback.

1 papers0 benchmarksImages, Texts

Traditional and Context-specific Spam Twitter

This data set is being released to support the spam and context-specific spam detection tasks on Twitter data.

1 papers1 benchmarksTexts

larousse_1905_wd

This dataset links all the entries describing named entities of Petit Larousse illustré, a French dictionary published in 1905, to wikidata identifiers. The dataset is available in the JSON format as a list of entries, where each entry is a dictionary with two keys: the text of the entry and the list of wikidata identifiers. For example, for the entry AALI-PACHA: {'texte': "AALI-PACHA, homme d'Etat turc, né à Constantinople. Il a attaché son nom à la politique de réformes du Tanzimat (1815-1871).", 'qid': ['Q439237']}

1 papers0 benchmarksTexts

Replication Data for: Singapore Soundscape Site Selection Survey (S5) (Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering)

This dataset contains the data used for all statistical analysis in our publication "Singapore Soundscape Site Selection Survey (S5): Identification of Characteristic Soundscapes of Singapore via Weighted k-means Clustering", summarised in a single .csv file.

1 papers0 benchmarksTexts

DME VQA dataset (Diabetic Macular Edema VQA dataset)

Medical VQA dataset built from the IDRiD and eOphta datasets. The dataset contains both healthy and unhealthy fundus images. For each image, a set of pre-defined questions is generated, including questions about regions (e.g. are there hard exudates in this region?), for which an associated mask denotes the location of the region.

1 papers0 benchmarksImages, Texts

CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).

1 papers0 benchmarksTexts

Irish Wikipedia

Text from Irish Wikipedia, an online encyclopedia.

1 papers0 benchmarksTexts

Mapping Topics in 100,000 Real-Life Moral Dilemmas (Tuan Dung nguyen)

This dataset accompanies the ICWSM 2022 paper "Mapping Topics in 100,000 Real-Life Moral Dilemmas".

1 papers0 benchmarksTexts

Short Stories, Adjudicator Scores and Written Reflections

In this Adjudicator Scores_Short Stories and Written Reflections folder: Four files from four student participants of the contest. Each file contains

1 papers0 benchmarksImages, Texts

Student Reflections, Coding and Coding Scheme

In this Coding and Coding Scheme spreadsheet: Student answers to reflection questions from pre-context workshops; coding scheme for student reflections; and coding of student reflection

1 papers0 benchmarksTexts

Pre-Contest Workshop Slidedeck

In this Pre-Contest Workshop Slidedeck.pdf: Instructional materials delivered for the seven pre-contest workshops

1 papers0 benchmarksImages, Texts

Names pairs dataset

Includes co-referent name string pairs along with their similarities.

1 papers0 benchmarksTexts

CareerCoach 2022

The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.

1 papers0 benchmarksTexts

Multilingual Persuasion Detection

This dataset contains dialogue lines from the games Knights of the Old Republic 1 & 2 and Neverwinter Nights 1. Some of the dialogue lines are marked as persuasive (which is when the player character is attempting a Persuade skill check.)

1 papers0 benchmarksTexts

Taskography (PDDLGym Taskography)

PDDL dataset of Rearrangement tasks in large-scale 3D scene graphs.

1 papers0 benchmarksTexts

MatriVasha: (MatriVasha: Compound Character atasetD)

MatriVasha the largest dataset of handwritten Bangla compound characters for research on handwritten Bangla compound character recognition. The proposed dataset contains 120 different types of compound characters that consist of 306,464‬ images written where 152,950 male and 153,514 female handwritten Bangla compound characters. This dataset can be used for other issues such as gender, age, district base handwriting research because the sample was collected that included district authenticity, age group, and an equal number of men and women.

1 papers0 benchmarksImages, Texts

The Mafia Dataset

The Mafia Dataset was created to model the behavior of deceptive actors in the context of the Mafia game, as described in the paper “Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia”. We hope that this dataset will be of use to others studying the effects of deception on language use.

1 papers0 benchmarksDialog, Interactive, Texts

ANTILLES (ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus)

ANTILLES is a part-of-speech tagging corpus based on UD_French-GSD which was originally created in 2015 and is based on the universal dependency treebank v2.0.

1 papers1 benchmarksTexts

Phrase-in-Context

Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.

1 papers0 benchmarksTexts

PreviousPage 119 of 158Next