Datasets

3,148 machine learning datasets

3,148 dataset results

Schiller (Shiller)

Schiller contains handwritten texts written in modern German. Train sample consists of 244 lines, validation - 21 lines and test - 63 lines.

3 papers0 benchmarksImages, Texts

Ricordi

Ricordi contains handwritten texts written in Italian. Train sample consists of 295 lines, validation - 19 lines and test - 69 lines.

3 papers0 benchmarksImages, Texts

Patzig

Patzig contains handwritten texts written in modern German. Train sample consists of 485 lines, validation - 38 lines and test -118 lines.

3 papers0 benchmarksImages, Texts

Schwerin

Schwerin contains handwritten texts written in medieval German. Train sample consists of 793 lines, validation - 68 lines and test - 196 lines.

3 papers0 benchmarksImages, Texts

CoWeSe (Corpus Web Salud Espanol)

CoWeSe is a Spanish biomedical corpus consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020.

3 papers0 benchmarksTexts

FloDial (Flowchart Grounded Dialogs Dataset)

Flowchart Grounded Dialog Dataset (FloDial) is a corpus of troubleshooting dialogs between a user and an agent collected using Amazon Mechanical Turk. The dataset is accompanied with two knowledge sources over which the dialogs are grounded: (1) a set of troubleshooting flowcharts and (2) a set of FAQs which contains supplementary information about the domain not present in the flowchart. FloDial consists of 2,738 dialogs grounded on 12 different troubleshooting flowcharts.

3 papers0 benchmarksTexts

MFAQ

MFAQ is a multilingual FAQ dataset publicly available. It contains around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics.

3 papers0 benchmarksTexts

CNewSum

CNewSum is a large-scale Chinese news summarization dataset which consists of 304,307 documents and human-written summaries for the news feed. It has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models. An additional distinguishing feature of CNewSum is that its test set contains adequacy and deducibility annotations for the summaries.

3 papers0 benchmarksTexts

Coveo Data Challenge Dataset

The 2021 SIGIR workshop on eCommerce is hosting the Coveo Data Challenge for "In-session prediction for purchase intent and recommendations". The challenge addresses the growing need for reliable predictions within the boundaries of a shopping session, as customer intentions can be different depending on the occasion. The need for efficient procedures for personalization is even clearer if we consider the e-commerce landscape more broadly: outside of giant digital retailers, the constraints of the problem are stricter, due to smaller user bases and the realization that most users are not frequently returning customers. We release a new session-based dataset including more than 30M fine-grained browsing events (product detail, add, purchase), enriched by linguistic behavior (queries made by shoppers, with items clicked and items not clicked after the query) and catalog meta-data (images, text, pricing information). On this dataset, we ask participants to showcase innovative solutions fo

3 papers4 benchmarksEnvironment, Images, Texts

ParsTwiner

An open, broad-coverage corpus for informal Persian named entity recognition was collected from Twitter.

3 papers0 benchmarksTexts

Emotional Dialogue Acts

Emotional Dialogue Acts data contains dialogue act labels for existing emotion multi-modal conversational datasets. We chose two popular multimodal emotion datasets: Multimodal EmotionLines Dataset (MELD) and Interactive Emotional dyadic MOtion CAPture database (IEMOCAP). EDAs reveal associations between dialogue acts and emotional states in a natural-conversational language such as Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy.

3 papers0 benchmarksTexts

Duolingo SLAM Shared Task

This repository contains gzipped files containing more than 2 million tokens (words) from answers submitted by more than 6,000 students over the course of their first 30 days of using Duolingo. It also contains baseline starter code written in Python. There are three data sets, corresponding to three different language courses. More details on the data set and task are available at: http://sharedtask.duolingo.com. (2018-01-10)

3 papers0 benchmarksTexts

Duolingo Spaced Repetition Data

This is a gzipped CSV file containing the 13 million Duolingo student learning traces used in experiments by Settles & Meeder (2016). For more details and replication source code, visit: https://github.com/duolingo/halflife-regression (2016-06-07)

3 papers0 benchmarksTexts

CoVaxLies v1

CoVaxLies v1 includes 17 known Misinformation Targets (MisTs) found on Twitter about the covid-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.

3 papers0 benchmarksTexts

JUSTICE (JUSTICE: A Dataset for Supreme Court’s Judgment Prediction)

The dataset contains 3304 cases from the Supreme Court of the United States from 1955 to 2021. Each case has the case's identifiers as well as the facts of the case and the decision outcome. Other related datasets rarely included the facts of the case which could prove to be helpful in natural language processing applications. One potential use case of this dataset is determining the outcome of a case using its facts.

3 papers0 benchmarksTexts

DISRPT2021 (DISRPT2021 shared task on Discourse Unit Segmentation, Connective Detection and Discourse Relation Classification)

The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.

3 papers0 benchmarksSpeech, Texts

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition. It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

3 papers0 benchmarksTexts

FFHQ-Text

FFHQ-Text is a small-scale face image dataset with large-scale facial attributes, designed for text-to-face generation & manipulation, text-guided facial image manipulation, and other vision-related tasks. This dataset is an extension of the NVIDIA Flickr-Faces-HQ Dataset (FFHQ), which is the selected top 760 female FFHQ images that only contain one complete human face.

3 papers0 benchmarksImages, Texts

COPA-SSE

Semi-Structured Explanations for COPA (COPA-SSE) is a new crowdsourced dataset of 9,747 semi-structured, English common sense explanations for COPA questions. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts. This semi-structured format strikes a balance between the high quality but low coverage of structured data and the lower quality but high coverage of free-form crowdsourcing. Each explanation also includes a set of human-given quality ratings. With their familiar format, the explanations are geared towards commonsense reasoners operating on knowledge graphs and serve as a starting point for ongoing work on improving such systems.

3 papers0 benchmarksTexts

DanFEVER

We present a dataset, DANFEVER, intended for claim verification in Danish. The dataset builds upon the task framing of the FEVER fact extraction and verification challenge. DANFEVER can be used for creating models for detecting mis- & disinformation in Danish as well as for verification in multilingual settings.

3 papers1 benchmarksTexts

PreviousPage 78 of 158Next