Datasets

3,148 machine learning datasets

3,148 dataset results

ESP (Evaluation for Styled Prompt)

ESP dataset (Evaluation for Styled Prompt dataset) is a benchmark for zero-shot domain-conditional caption generation. ESP is a new dataset focusing on providing multiple styled text targets for the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collect five text domains with everyday usage: blog, social media, instruction, story, and news.

1 papers0 benchmarksImages, Texts

XLingEval

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

MuLMS (Multi-Layer Materials Science)

The Multi-Layer Materials Science corpus (MuLMS) consists of 50 documents (licensed CC BY) from the materials science domain, spanning across the following 7 subareas: "Electrolysis", "Graphene", "Polymer Electrolyte Fuel Cell (PEMFC)", "Solid Oxide Fuel Cell (SOFC)", "Polymers", "Semiconductors" and "Steel". It was exhaustively annotated by domain experts. There are annotations on sentence-level and token-level for the following NLP tasks: measurement frame detection, NER, relation extraction, and argumentative zones classifications.

1 papers0 benchmarksTexts

Analysing state-backed propaganda websites: a new dataset and linguistic study

This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.

1 papers0 benchmarksTexts

GlotSparse

Collection of news websites in low-resource languages.

1 papers0 benchmarksTexts

WEATHub

WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.

1 papers0 benchmarksTexts

Product Reviews 2017

The corpus contains review sentences mostly of products in electronics domain, annotated and segregated into 4 comparison categories. Each comparison sentence is annotated with names of the products (PROD1 and PROD2), the aspect (ASP) and the predicate (PRED). Dataset contains sentences after auto-labeling on SNAP dataset and manually labeled sentences from the following corpora:

1 papers1 benchmarksTexts

RVL-CDIP_MP (RVL-CDIP multi-page)

RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.

1 papers0 benchmarksImages, Texts

RVL-CDIP_N_MP (RVL-CDIP-N multi-page)

RVL-CDIP_MP-N can serve its original goal as a covariate shift test set, now for multi-page document classification. We were able to retrieve the original full documents from DocumentCloud and Web Search.

1 papers0 benchmarksImages, Texts

This is not a Dataset (This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models)

We introduce a large semi-automatically generated dataset of ~400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms that we use to evaluate LLMs.

1 papers4 benchmarksTexts

BioFuelQR

BioFuelQR is a dataset consisting of complex reasoning questions related to catalyst discovery in biofuels. This dataset is aimed at benchmarking scientific question answering methods, particularly for search based text generation.

1 papers0 benchmarksTexts

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.

1 papers4 benchmarksBiology, Biomedical, Texts

reader_engagement

Reader eye tracking and engagement scores for two short stories, aggregated by sentence.

1 papers0 benchmarksTexts

WinoMT-Hindi

Test set of sentences in Hindi with complex coreference involving two entities inspired by WinoBias format of sentences in English. Includes grammatical gender cues of Hindi to test gender bias in Hindi-English NMT Systems.

1 papers0 benchmarksTexts

OTSC-Hindi (Occupation Test Set with Simple Context - Hindi)

Test set of sentences in Hindi with simple gender-specific context used to measure gender bias in NMT systems for Hindi-English.

1 papers0 benchmarksTexts

Alpaca Data Galician

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Tamil Alpaca

Dataset Card for "tamil-alpaca"

1 papers0 benchmarksTexts

Tamil Alpaca Orca

Dataset Card for "tamil-alpaca" This repository includes a Tamil-translated versions of the Alpaca dataset and a subset of OpenOrca dataset.

1 papers0 benchmarksTexts

GneutralSpeech Female

A Brazilian Portuguese TTS dataset featuring a female voice recorded with high quality in a controlled environment, with neutral emotion and more than 20 hours of recordings. with neutral emotion and more than 20 hours of recordings. Our dataset aims to facilitate transfer learning for researchers and developers working on TTS applications: a highly professional neutral female voice can serve as a good warm-up stage for learning language-specific structures, pronunciation and other non-individual characteristics of speech, leaving to further training procedures only to learn the specific adaptations needed (e.g. timbre, emotion and prosody). This can surely help enabling the accommodation of a more diverse range of female voices in Brazilian Portuguese. By doing so, we also hope to contribute to the development of accessible and high-quality TTS systems for several use cases such as virtual assistants, audiobooks, language learning tools and accessibility solutions.

1 papers0 benchmarksAudio, Texts

GneutralSpeech Male

A database containing high sampling rate recordings of a single speaker reading sentences in Brazilian Portuguese with neutral voice, along with the corresponding text corpus. Intended for speech synthesis and automatic speech recognition applications, the dataset contains text extracted from a popular Brazilian news TV program, totalling roughly 20 h of audio spoken by a trained individual in a controlled environment. The text was normalized in the recording process and special textual occurrences (e.g. acronyms, numbers, foreign names etc.) were replaced by their phonetic translation to a readable text in Portuguese. There are no noticeable accidental sounds and background noise has been kept to a minimum in all audio samples.

1 papers0 benchmarksAudio, Texts

PreviousPage 130 of 158Next