Datasets

3,148 machine learning datasets

3,148 dataset results

Couples Therapy (Couples Therapy Corpus)

The Couples Therapy corpus contains audio, video recordings and manual transcriptions of conversations between 134 real-life couples attending marital therapy. In each session, one person selected a topic that was discussed over 10 minutes with the spouse. At the end of the session, both speakers were rated separately on 33 “behavior codes” by multiple annotators based on the Couples Interaction and Social Support Rating Systems. Each behavior was rated on a Likert scale from 1, indicating absence, to 9, indicating strong presence. A session-level rating was obtained for each speaker by averaging the annotator ratings. This process was repeated for the spouse, resulting in 2 sessions per couple at a time. The total number of sessions per couple varied between 2 and 6.

0 papers0 benchmarksAudio, Texts, Videos

DUC 2005

The DUC 2005 data set is a dataset for summarization which consists of 50 document collections of 25 documents each; each document collection includes a human-written query. Each document collection additionally has five human-written “reference” summaries (250 words long, each) that serve as the gold standard

0 papers0 benchmarksTexts

Affective Text

Affective Text (Test Corpus of SemEval 2007) by Carlo Strapparava & Rada Mihalcea.

0 papers0 benchmarksTexts

AlloCine

A new dataset for sentiment analysis, scraped from Allociné.fr user reviews. It contains 100k positive and 100k negative reviews divided into 3 balanced splits: train (160k reviews), val (20k) and test (20k).

0 papers0 benchmarksTexts

ASSIN

ASSIN (Avaliação de Similaridade Semântica e INferência textual) is a dataset with semantic similarity score and entailment annotations. It was used in a shared task in the PROPOR 2016 conference.

0 papers0 benchmarksTexts

Legal Documents Entity Recognition

Court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).

0 papers0 benchmarksTexts

NERGRIT Corpus

NERGRIT involves machine learning based NLP Tools and a corpus used for Indonesian Named Entity Recognition, Statement Extraction, and Sentiment Analysis.

0 papers0 benchmarksTexts

NSMC (Naver Sentiment Movie Corpus)

This is a movie review dataset in the Korean language. Reviews were scraped from Naver Movies.

0 papers0 benchmarksTexts

Plaintext Jokes

There are about 208 000 jokes in this database scraped from three sources.

0 papers0 benchmarksTexts

prachathai-67k

The prachathai-67k dataset was scraped from the news site Prachathai excluding articles with less than 500 characters of body text (mostly images and cartoons). It contains 67,889 articles with 51,797 tags from August 24, 2004 to November 15, 2018.

0 papers0 benchmarksTexts

QuAIL (Question Answering for Artificial Intelligence)

A new kind of question-answering dataset that combines commonsense, text-based, and unanswerable questions, balanced for different genres and reasoning types. Reasoning type annotation for 9 types of reasoning: temporal, causality, factoid, coreference, character properties, their belief states, subsequent entity states, event durations, and unanswerable. Genres: CC license fiction, Voice of America news, blogs, user stories from Quora 800 texts, 18 questions for each (~14K questions).

0 papers0 benchmarksTexts

TRACT (Tweets Reporting Abuse Classification Task Corpus)

TRACT is a small scale manually annotated corpus for abuse classification problem.

0 papers0 benchmarksTexts

TiMoS (Tropes in Movie Synopses)

Tropes in Movie Synopses (TiMoS) is a dataset of movie tropes collected from a Wikipedia-style website, TVTropes3 with 5623 movie synopses associated with 95 most occurred tropes. The movies are diverse in genre, filming year, length, and style, making the task challenging and unable to rely on patterns from a specific domain. The tropes involve character trait, role interaction, situation, and storyline, which could be sensed by a non-expert human but remains challenging for machines that have more than 100 million parameters and pre-trained with 11,000 books and the whole Wikipedia (23.97 F1 score while a human could reach 64.87).

0 papers0 benchmarksTexts

ISI-PPT

This is a Dataset for Arabic/English text detection and optical character recognition. All image data are text-slides extracted from PowerPoint files downloaded from Internet through the Google API. All annotations are automatically generated mainly through the WinCom32 Python API. Postprocess is also applied to place a more accurate text bounding box or to suppress false-alarms, e.g. a text box only containing spaces. Finally, all annotation results are briefly reviewed by human to reject extreme bad samples, e.g. a slide with a large portion of copied table as image. In summary, this dataset contains 10,692 images, and roughly 100K line samples.

0 papers0 benchmarksImages, Texts

DSTC 8 Track 2 (Dialog System Technology Challenges 8 Track 2)

Dialog System Technology Challenges 8 (DSTC) Track 2 builds on the success of DSTC 7 Track 1 (NOESIS: Noetic End-to-End Response Selection Challenge). It proposes an extension of the task, incorporating new elements that are vital for the creation of a deployed task-oriented dialogue system. Specifically, three new dimensions are added to the challenge:

0 papers0 benchmarksTexts

Multimodal Humor Dataset (Multimodal Humor Dataset: Predicting Laughter Tracks for Sitcoms)

A great number of situational comedies (sitcoms) are being regularly made and the task of adding laughter tracks to these is a critical task. Providing an ability to be able to predict whether something will be humorous to the audience is also crucial. In this project, we aim to automate this task. Towards doing so, we annotate an existing sitcom (Big Bang Theory') and use the laughter cues present to obtain a manual annotation for this show. We provide detailed analysis for the dataset design and further evaluate various state of the art baselines for solving this task. We observe that existing LSTM and BERT based networks on the text alone do not perform as well as joint text and video or only video-based networks. Moreover, it is challenging to ascertain that the words attended to while predicting laughter are indeed humorous. Our dataset and analysis provided through this paper is a valuable resource towards solving this interesting semantic and practical task. As an additional con

0 papers0 benchmarksImages, Texts

LSARS (Large Scale Abstractive multi-Review Summarization)

In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.

0 papers0 benchmarksTexts

PreviousPage 154 of 158Next

Datasets

Couples Therapy (Couples Therapy Corpus)

DUC 2005

Affective Text

AlloCine

ASSIN

Legal Documents Entity Recognition

NERGRIT Corpus

NSMC (Naver Sentiment Movie Corpus)

Plaintext Jokes

prachathai-67k

QuAIL (Question Answering for Artificial Intelligence)

TRACT (Tweets Reporting Abuse Classification Task Corpus)

TiMoS (Tropes in Movie Synopses)

ISI-PPT

DSTC 8 Track 2 (Dialog System Technology Challenges 8 Track 2)

Multimodal Humor Dataset (Multimodal Humor Dataset: Predicting Laughter Tracks for Sitcoms)

LSARS (Large Scale Abstractive multi-Review Summarization)

WMT 2021 Ge'ez-Amharic

Eduge (Eduge news classification dataset)

BNLP-Resources

Datasets

Couples Therapy (Couples Therapy Corpus)

DUC 2005

Affective Text

AlloCine

ASSIN

Legal Documents Entity Recognition

NERGRIT Corpus

NSMC (Naver Sentiment Movie Corpus)

Plaintext Jokes

prachathai-67k

QuAIL (Question Answering for Artificial Intelligence)

TRACT (Tweets Reporting Abuse Classification Task Corpus)

TiMoS (Tropes in Movie Synopses)

ISI-PPT

DSTC 8 Track 2 (Dialog System Technology Challenges 8 Track 2)

Multimodal Humor Dataset (Multimodal Humor Dataset: Predicting Laughter Tracks for Sitcoms)

LSARS (Large Scale Abstractive multi-Review Summarization)

WMT 2021 Ge'ez-Amharic

Eduge (Eduge news classification dataset)

BNLP-Resources