Datasets

3,148 machine learning datasets

3,148 dataset results

Colors

A large dataset of color names and their respective RGB values stores in CSV.

UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.

1 papers0 benchmarksTexts

Conic10K

Conic10K is an open-ended math problem dataset on conic sections in Chinese senior high school education. This dataset contains 10,861 carefully annotated problems, each one has a formal representation, the corresponding text spans, the answer, and natural language rationales. These questions require long reasoning steps while the topic is limited to conic sections. It could be used to evaluate models with 2 tasks: semantic parsing and mathematical question answering (mathQA).

1 papers0 benchmarksTexts

A Dataset for Relation Extraction of Natural-Products (A curated evaluation dataset for end-to-end Relation Extraction of relationships between organisms and natural-products)

A curated evaluation dataset for end-to-end Relation Extraction of relationships between organisms and natural-products.

1 papers0 benchmarksBiomedical, Texts

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

The Multi-Eup is a new multilingual benchmark dataset, comprising 22K multilingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias.

1 papers0 benchmarksTexts

EyeInfo

The EyeInfo Dataset is an open-source eye-tracking dataset created by Fabricio Batista Narcizo, a research scientist at the IT University of Copenhagen (ITU) and GN Audio A/S (Jabra), Denmark. This dataset was introduced in the paper "High-Accuracy Gaze Estimation for Interpolation-Based Eye-Tracking Methods" (DOI: 10.3390/vision5030041). The dataset contains high-speed monocular eye-tracking data from an off-the-shelf remote eye tracker using active illumination. The data from each user has a text file with data annotations of eye features, environment, viewed targets, and facial features. This dataset follows the principles of the General Data Protection Regulation (GDPR).

1 papers0 benchmarksTabular, Texts, Tracking, Videos

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market (Replication Data for: "Centralized or Decentralized?")

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTabular, Texts, Time series

FedNLP (FOMC Docs and Speeches)

We collect the various forms of Federal Reserve communications.

1 papers0 benchmarksTexts

UHGEvalDataset

UHGEvalDataset contains over 5000 news items. It can be used in hallucination evaluation or detection tasks.

1 papers0 benchmarksTexts

ITCPR dataset (Image-Text Composed Person Retrieval dataset)

The ITCPR dataset is a comprehensive collection specifically designed for the Zero-Shot Composed Person Retrieval (ZS-CPR) task. It consists of a total of 2,225 annotated triplets, derived from three distinct datasets: Celeb-reID, PRCC, and LAST.

1 papers6 benchmarksImages, Texts

First HAREM (Primeiro HAREM)

HAREM, an initiative by Linguateca, boasts a Golden Collection—a meticulously curated repository of annotated Portuguese texts. This resource serves as a pivotal benchmark for evaluating systems in recognizing mentioned entities within documents. It stands as a cornerstone, supporting advancements and innovations in Portuguese language processing research, providing a comprehensive foundation for evaluating system performances and fostering ongoing developments in this domain.

1 papers0 benchmarksTexts

Mini HAREM

The MiniHAREM, a reiteration of the 2005 evaluation, used the same methodology and platform. Held from April 3rd to 5th, 2006, it offered participants a 48-hour window to annotate, verify, and submit text collections. Results are available, and the collection used is accessible. Participant lists, submitted outputs, and updated guidelines are provided. Additionally, the HAREM format checker ensures compliance with MiniHAREM directives. Information for the HAREM Meeting, open for registration until June 15th after the Linguateca Summer School in the University of Porto, is also available.

1 papers0 benchmarksTexts

HpVaxFrames

HpVaxFrames includes 64 Vaccine Hesitancy Framings found on Twitter about the HPV vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.

1 papers0 benchmarksTexts

VaccineFrames

Combines CoVaxFrames and HpVaxFrames into a unified dataset of 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines and 64 Vaccine Hesitancy Framings found on Twitter about the HPV vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.

1 papers0 benchmarksTexts

MapReader Data (in GeoHumanities workshop, SIGSPATIAL 2022)

MapReader in GeoHumanities workshop (SIGSPATIAL 2022): Gold standards and outputs

1 papers0 benchmarksImages, Texts

Urdu News Headlines Dataset

Urdu News Headlines Dataset with VOA and BBC An Urdu news headlines dataset is a collection of news headlines in the Urdu language, typically scraped from news websites and social media platforms. These datasets can be valuable for researchers and developers working on a variety of tasks, such as:

1 papers1 benchmarksTexts

GuardRails Dataset (GuardRails Dataset of Problems with Known Ambiguities)

For each problem, we provide 4 variants of prompts:

1 papers0 benchmarksTexts

SAIL 2017 (Sentiment Analysis for Indian Languages)

India is a linguistic area with one of the longest histories of contact, influence, use, teaching and learning of English-in-diaspora in the world (Kachru and Nelson, 2006). Thus, a huge number of Indians active on the internet are able in English communication to some degree. India also enjoys huge diversity in language. Apart from Hindi, it has several regional languages that are the primary tongue of people native to the region. This is to the extent that social media including Facebook, WhatsApp, Twitter, etc. contain more than one language, and such phenomena are called code-mixing and code-switching. On the other side, the evolution of sentiments from such social media texts have also created many new opportunities for information access and language technology, but also many new challenges, making it one of the prime present-day research areas. Sentiment analysis in code-mixed data has several real-life applications in opinion mining from social media campaign to feedback analys

1 papers3 benchmarksTexts

SpanEX

Reasoning over spans of tokens from different parts of the input is essential for natural language understanding (NLU) tasks such as fact-checking (FC), machine reading comprehension (MRC) or natural language inference (NLI). We introduce SpanEx, a multi-annotator dataset of human-annotated span interaction explanations for two NLU tasks: NLI and FC.

1 papers0 benchmarksTexts

2D-ATOMS

Official dataset for Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models. Ziqiao Ma, Jacob Sansom, Run Peng, Joyce Chai. EMNLP Findings, 2023.

1 papers0 benchmarksTexts

PreviousPage 131 of 158Next

Datasets

Colors

UNER v1 (Universal NER v1)

Conic10K

A Dataset for Relation Extraction of Natural-Products (A curated evaluation dataset for end-to-end Relation Extraction of relationships between organisms and natural-products)

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

EyeInfo

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market (Replication Data for: "Centralized or Decentralized?")

FedNLP (FOMC Docs and Speeches)

UHGEvalDataset

ITCPR dataset (Image-Text Composed Person Retrieval dataset)

First HAREM (Primeiro HAREM)

Mini HAREM

HpVaxFrames

VaccineFrames

MapReader Data (in GeoHumanities workshop, SIGSPATIAL 2022)

Urdu News Headlines Dataset

GuardRails Dataset (GuardRails Dataset of Problems with Known Ambiguities)

SAIL 2017 (Sentiment Analysis for Indian Languages)

SpanEX

2D-ATOMS

Datasets

Colors

UNER v1 (Universal NER v1)

Conic10K

A Dataset for Relation Extraction of Natural-Products (A curated evaluation dataset for end-to-end Relation Extraction of relationships between organisms and natural-products)

Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval

EyeInfo

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market (Replication Data for: "Centralized or Decentralized?")

FedNLP (FOMC Docs and Speeches)

UHGEvalDataset

ITCPR dataset (Image-Text Composed Person Retrieval dataset)

First HAREM (Primeiro HAREM)

Mini HAREM

HpVaxFrames

VaccineFrames

MapReader Data (in GeoHumanities workshop, SIGSPATIAL 2022)

Urdu News Headlines Dataset

GuardRails Dataset (GuardRails Dataset of Problems with Known Ambiguities)

SAIL 2017 (Sentiment Analysis for Indian Languages)

SpanEX

2D-ATOMS