Datasets

3,148 machine learning datasets

3,148 dataset results

SWSR (Sina Weibo Sexism Review)

The Sina Weibo Sexism Review (SWSR) dataset is a dataset to research online sexism in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language.

6 papers0 benchmarksTexts

Lyra

Lyra is a dataset for code generation that consists on Python code with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real usage projects. Each program is paired with both a Chinese comment and an English comment.

6 papers0 benchmarksTexts

BSARD (Belgian Statutory Article Retrieval Dataset)

The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.

6 papers3 benchmarksTexts

GD-VCR

Geo-Diverse Visual Commonsense Reasoning (GD-VCR) is a new dataset to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense.

6 papers2 benchmarksImages, Texts

FusedChat

FusedChat is an inter-mode dialogue dataset. It contains dialogue sessions fusing task-oriented dialogues (TOD) and open-domain dialogues (ODD). Based on MultiWOZ, FusedChat appends or prepends an ODD to every existing TOD. See more details in the paper.

6 papers44 benchmarksTexts

MOLD (Marathi Offensive Language Dataset)

MOLD is a Marathi dataset for offensive language identification

6 papers0 benchmarksTexts

CoDa (The Color Dataset)

The Color Dataset (CoDa) is a probing dataset to evaluate the representation of visual properties in language models. CoDa consists of color distributions for 521 common objects, which are split into 3 groups: Single, Multi, and Any.

6 papers0 benchmarksTexts

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

6 papers0 benchmarksTexts

SemEval-2020 Task-8

A multimodal dataset for sentiment analysis on internet memes.

6 papers0 benchmarksImages, Texts

GLips (German Lips)

The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also

6 papers0 benchmarksAudio, Texts, Videos

Reddit Conversation Corpus

Reddit Conversation Corpus (RCC) consists of conversations, scraped from Reddit, for a 20 month period from November 2016 until August 2018. To ensure the quality and diversity of topics, 95 subreddits are selected from which conversations are collected. In total, RCC contains 9.2 million 3-turn conversations.

6 papers0 benchmarksTexts

NLU++ (NLLU++ : A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue)

nlu++ is a dataset for natural language understanding (NLU) in task-oriented dialogue (ToD) systems, with the aim to provide a much more challenging evaluation environment for dialogue NLU models, up to date with the current application and industry requirements. nlu++ is divided into two domains (banking and hotels) and brings several crucial improvements over current commonly used NLU datasets. 1) Nlu++ provides fine-grained domain ontologies with a large set of challenging multi-intent sentences, introducing and validating the idea of intent modules that can be combined into complex intents that convey complex user goals, combined with finer-grained and thus more challenging slot sets. 2) The ontology is divided into domain-specific and generic (i.e., domain-universal) intent modules that overlap across domains, promoting cross-domain reusability of annotated examples. 3) The dataset design has been inspired by the problems observed in industrial ToD systems, and 4) it has been coll

6 papers0 benchmarksTexts

Kompetencer (Danish Job Postings Classification Dataset)

Kompetencer (en: competences) is a Danish job posting dataset annotated for nested spans of competences.

6 papers0 benchmarksTexts

CLAMS (Cross-linguistic Analysis of Models on Syntax)

Targeted syntactic evaluation datasets in 5 languages: English, French, German, Russian, and Hebrew. Data are translated from the targeted syntactic evaluation data of Marvin & Linzen (2018): https://aclanthology.org/D18-1151/ . All stimuli focus on subject-verb agreement.

6 papers0 benchmarksTexts

French Timebank

French TimeBank, a corpus for French annotated in ISO-TimeML.

6 papers3 benchmarksTexts

FrenchMedMCQA (FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain)

This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.

6 papers2 benchmarksBiomedical, Medical, Texts

legal_NER

legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

6 papers0 benchmarksTexts

CELLS

CELLS is a large (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of the dataset.

6 papers0 benchmarksTexts

KAMEL (Knowledge Analysis with Multitoken Entities in Language Models)

KAMEL comprises knowledge about 234 relations from Wikidata with a large training, validation, and test dataset. We make sure that all facts are also present in Wikipedia so that they have been seen during the pre-training procedure of the LMs we are probing. Most importantly we overcome the limitations of existing probing datasets by (1) having a larger variety of knowledge graph relations, (2) it contains single- and multi-token entities, (3) we use relations with literals, and (4) have alternative labels for entities. (5) Furthermore, we created an evaluation procedure for higher cardinality relations, which was missing in previous works, and (6) make sure that the dataset can be used for causal LMs.

6 papers1 benchmarksTexts

ArmanEmo

ArmanEmo is a human-labeled emotion dataset of more than 7000 Persian sentences labeled for seven categories. The dataset has been collected from different resources, including Twitter, Instagram, and Digikala (an Iranian e-commerce company) comments. Labels are based on Ekman's six basic emotions (Anger, Fear, Happiness, Hatred, Sadness, Wonder) and another category (Other) to consider any other emotion not included in Ekman's model.

6 papers3 benchmarksTexts

PreviousPage 58 of 158Next