3,148 machine learning datasets
3,148 dataset results
The Couples Therapy corpus contains audio, video recordings and manual transcriptions of conversations between 134 real-life couples attending marital therapy. In each session, one person selected a topic that was discussed over 10 minutes with the spouse. At the end of the session, both speakers were rated separately on 33 “behavior codes” by multiple annotators based on the Couples Interaction and Social Support Rating Systems. Each behavior was rated on a Likert scale from 1, indicating absence, to 9, indicating strong presence. A session-level rating was obtained for each speaker by averaging the annotator ratings. This process was repeated for the spouse, resulting in 2 sessions per couple at a time. The total number of sessions per couple varied between 2 and 6.
The DUC 2005 data set is a dataset for summarization which consists of 50 document collections of 25 documents each; each document collection includes a human-written query. Each document collection additionally has five human-written “reference” summaries (250 words long, each) that serve as the gold standard
Affective Text (Test Corpus of SemEval 2007) by Carlo Strapparava & Rada Mihalcea.
A new dataset for sentiment analysis, scraped from Allociné.fr user reviews. It contains 100k positive and 100k negative reviews divided into 3 balanced splits: train (160k reviews), val (20k) and test (20k).
ASSIN (Avaliação de Similaridade Semântica e INferência textual) is a dataset with semantic similarity score and entailment annotations. It was used in a shared task in the PROPOR 2016 conference.
Court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG).
NERGRIT involves machine learning based NLP Tools and a corpus used for Indonesian Named Entity Recognition, Statement Extraction, and Sentiment Analysis.
This is a movie review dataset in the Korean language. Reviews were scraped from Naver Movies.
There are about 208 000 jokes in this database scraped from three sources.
The prachathai-67k dataset was scraped from the news site Prachathai excluding articles with less than 500 characters of body text (mostly images and cartoons). It contains 67,889 articles with 51,797 tags from August 24, 2004 to November 15, 2018.
A new kind of question-answering dataset that combines commonsense, text-based, and unanswerable questions, balanced for different genres and reasoning types. Reasoning type annotation for 9 types of reasoning: temporal, causality, factoid, coreference, character properties, their belief states, subsequent entity states, event durations, and unanswerable. Genres: CC license fiction, Voice of America news, blogs, user stories from Quora 800 texts, 18 questions for each (~14K questions).
TRACT is a small scale manually annotated corpus for abuse classification problem.
Tropes in Movie Synopses (TiMoS) is a dataset of movie tropes collected from a Wikipedia-style website, TVTropes3 with 5623 movie synopses associated with 95 most occurred tropes. The movies are diverse in genre, filming year, length, and style, making the task challenging and unable to rely on patterns from a specific domain. The tropes involve character trait, role interaction, situation, and storyline, which could be sensed by a non-expert human but remains challenging for machines that have more than 100 million parameters and pre-trained with 11,000 books and the whole Wikipedia (23.97 F1 score while a human could reach 64.87).
This is a Dataset for Arabic/English text detection and optical character recognition. All image data are text-slides extracted from PowerPoint files downloaded from Internet through the Google API. All annotations are automatically generated mainly through the WinCom32 Python API. Postprocess is also applied to place a more accurate text bounding box or to suppress false-alarms, e.g. a text box only containing spaces. Finally, all annotation results are briefly reviewed by human to reject extreme bad samples, e.g. a slide with a large portion of copied table as image. In summary, this dataset contains 10,692 images, and roughly 100K line samples.
Dialog System Technology Challenges 8 (DSTC) Track 2 builds on the success of DSTC 7 Track 1 (NOESIS: Noetic End-to-End Response Selection Challenge). It proposes an extension of the task, incorporating new elements that are vital for the creation of a deployed task-oriented dialogue system. Specifically, three new dimensions are added to the challenge:
A great number of situational comedies (sitcoms) are being regularly made and the task of adding laughter tracks to these is a critical task. Providing an ability to be able to predict whether something will be humorous to the audience is also crucial. In this project, we aim to automate this task. Towards doing so, we annotate an existing sitcom (Big Bang Theory') and use the laughter cues present to obtain a manual annotation for this show. We provide detailed analysis for the dataset design and further evaluate various state of the art baselines for solving this task. We observe that existing LSTM and BERT based networks on the text alone do not perform as well as joint text and video or only video-based networks. Moreover, it is challenging to ascertain that the words attended to while predicting laughter are indeed humorous. Our dataset and analysis provided through this paper is a valuable resource towards solving this interesting semantic and practical task. As an additional con
In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.
WMT 2021 Ge'ez-Amharic is a Ge'ez-Amharic dataset prepared for NMT tasks of the 6th Workshop on NLP at Debre Berhan University, Ethiopia. The corpus has been collected from:
Eduge news classification dataset provided by Bolorsoft LLC. Used to train the Eduge.mn production news classifier 75K news articles in 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин
Datasets for Bangla Natural Language Processing tasks.