Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

3,148 dataset results

ReviewQA

ReviewQA is a question-answering dataset based on hotel reviews. The questions of this dataset are linked to a set of relational understanding competencies that a model is expected to master. Indeed, each question comes with an associated type that characterizes the required competency.

2 papers0 benchmarksTexts

RUSLAN

RUSLAN is a Russian spoken language corpus for text-to-speech task. RUSLAN contains 22,200 audio samples with text annotations – more than 31 hours of high-quality speech of one person – being one of the largest annotated Russian corpus in terms of speech duration for a single speaker.

2 papers0 benchmarksSpeech, Texts

SART

SART is a collection of three datasets for Similarity, Analogies and Relatedness for the Tatar language. The three subsets are: * Similarity dataset - 202 pairs of words along with averaged human scores of similarity degree between the words (in 0-to-10 scale). For example, "йорт, бина, 7.69". * Relatedness dataset - 252 pairs of words along with averaged human scores of relatedness degree between the words. For example, "урам, балалар, 5.38". * Analogies dataset - set of analytical questions of the form A:B::C:D, meaning A to B as C to D, and D is to be predicted. For example, "Әнкара Төркия Париж Франция". Contains 34 categories, and in total 30 144 questions.

2 papers0 benchmarksTexts

SemClinBr (A multi‑institutional and multi‑specialty semantically annotated corpus for Portuguese clinical NLP tasks)

Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic ev

2 papers1 benchmarksMedical, Texts

Sentimental LIAR

The Sentimental LIAR dataset is a modified and further extended version of the LIAR extension introduced by Kirilin et al. In this dataset, the multi-class labeling of LIAR is converted to a binary annotation by changing half-true, false, barely-true and pants-fire labels to False, and the remaining labels to True. Furthermore, the speaker names are converted to numerical IDs in order to avoid bias with regards to the textual representation of names. The binary-label dataset is then extended by adding sentiments derived using the Google NLP API. Sentiment analysis determines the overall attitude of the text (i.e., whether it is positive or negative), and is quantified by a numerical score. If the sentiment score is positive, then the sample is tagged as Positive for the sentiment attribute, otherwise Negative is assigned. A further extension is introduced by adding emotion scores extracted using the IBM NLP API for each claim, which determine the detected level of 6 emotional states na

2 papers0 benchmarksTexts

SubEdits

SubEdits is a human-annnoated post-editing dataset of neural machine translation outputs, compiled from in-house NMT outputs and human post-edits of subtitles form Rakuten Viki. It is collected from English-German annotations and contains 160k triplets.

2 papers0 benchmarksTexts

Talk2Nav

Talk2Nav is a large-scale dataset with verbal navigation instructions.

2 papers0 benchmarksImages, Texts

TaoDescribe

The TaoDescribe dataset contains 2,129,187 product titles and descriptions in Chinese.

2 papers0 benchmarksTexts

TicketTalk

A movie ticketing dialog dataset with 23,789 annotated conversations. The movie ticketing conversations range from completely open-ended and unrestricted to more structured, both in terms of their knowledge base, discourse features, and number of turns. In qualitative human evaluations, model-generated responses trained on just 10,000 TicketTalk dialogs were rated to "make sense" 86.5 percent of the time, almost the same as human responses in the same contexts.

2 papers0 benchmarksTexts

Tilde MODEL Corpus (Tilde Multilingual Open Data for European Languages)

Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.

2 papers0 benchmarksTexts

Twitch-FIFA

Twitch-FIFA is video-context, many-speaker dialogue dataset based on live-broadcast soccer game videos and chats from Twitch.tv. This dataset can be used to train visually-grounded dialogue models that generate relevant temporal and spatial event language from the live video, while also being relevant to the chat history.

2 papers0 benchmarksTexts, Videos

Twitter Conversations Dataset

This dataset is used for the task of conversational document prediction. The dataset includes conversations that occurred between users and customer care agents in 25 organizations on the Twitter platform. Each conversation ends with a customer care agent providing a URL to a document to resolve the issue the user is facing. The task is to predict the document given a dialog context. The train, dev and test datasets include 10000, 525 and 500 conversations respectively.

2 papers0 benchmarksTexts

ViText2SQL

ViText2SQL is a dataset for the Vietnamese Text-to-SQL semantic parsing task, consisting of about 10K question and SQL query pairs.

2 papers0 benchmarksTexts

WikiSRS

WikiSRS is a novel dataset of similarity and relatedness judgments of paired Wikipedia entities (people, places, and organizations), as assigned by Amazon Mechanical Turk workers.

2 papers0 benchmarksTexts

XL-R2R (Cross-lingual Room-to-Room)

The XL-R2R dataset is built upon the R2R dataset and extends it with Chinese instructions. XL-R2R preserves the same splits as in R2R and thus consists of train, val-seen, and val-unseen splits with both English and Chinese instructions, and test split with English instructions only.

2 papers0 benchmarksTexts

YASO

YASO is a crowd-sourced TSA evaluation dataset, collected using a new annotation scheme for labeling targets and their sentiments. The dataset contains 2,215 English sentences from movie, business and product reviews, and 7,415 terms and their corresponding sentiments annotated within these sentences.

2 papers0 benchmarksTexts

Wikidata-Disamb

The Wikidata-Disamb dataset is intended to allow a clean and scalable evaluation of NED with Wikidata entries, and to be used as a reference in future research.

2 papers0 benchmarksTexts

Chinese Classifier

Classifiers are function words that are used to express quantities in Chinese and are especially difficult for language learners. This dataset of Chinese Classifiers can be used to predict Chinese classifiers from context. The dataset contains a large collection of example sentences for Chinese classifier usage derived from three language corpora (Lancaster Corpus of Mandarin Chinese, UCLA Corpus of Written Chinese and Leiden Weibo Corpus). The data was cleaned and processed for a context-based classifier prediction task.

2 papers0 benchmarksTexts

QTuna

The QTUNA dataset is the result of a series of elicitation experiments in which human speakers were asked to perform a linguistic task that invites the use of quantified expressions in order to inform possible Natural Language Generation algorithms that mimic humans' use of quantified expressions.

2 papers0 benchmarksTexts

Metaphorical Connections

The Metaphorical Connections dataset is a poetry dataset that contains annotations between metaphorical prompts and short poems. Each poem is annotated whether or not it successfully communicates the idea of the metaphorical prompt.

2 papers0 benchmarksTexts

PreviousPage 87 of 158Next