Datasets

3,148 machine learning datasets

3,148 dataset results

Multi-CrossRE

Multi-CrossRE is a broadest multi-lingual dataset for Relation Extraction (RE) including 26 languages in addition to English, and covering six text domains. It is a machine translated version of CrossRE crossre, with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers.

1 papers0 benchmarksTexts

ChatGPT Advice Responses

Taking Advice from ChatGPT is a laboratory study of how student participants incorporate advice generated by ChatGPT. In a survey conducted through the Experimental Social Science Laboratory, 118 students answered 2,828 questions on topics from the MMLU benchmark. The rich dataset includes questions/choices, advice characteristics, participant answers, and participant background. It can be used to explore algorithm aversion, advice-taking, ChatGPT usage, and more.

1 papers0 benchmarksTexts

InstructOpenWiki

InstructOpenWiki is a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions.

1 papers0 benchmarksTexts

WhenAct (Temporal Human Action Localization in Lifestyle Vlogs)

We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods and leads to improvements over previous work for the task of temporal action localization.

1 papers0 benchmarksTexts, Videos

NaSGEC

NaSGEC is a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays.

1 papers0 benchmarksTexts

Legal Advice Reddit

Dataset Summary New dataset introduced in Parameter-Efficient Legal Domain Adaptation (Li et al., 2022) from the Legal Advice Reddit community (known as "/r/legaldvice"), sourcing the Reddit posts from the Pushshift Reddit dataset. The dataset maps the text and title of each legal question posted into one of eleven classes, based on the original Reddit post's "flair" (i.e., tag). Questions are typically informal and use non-legal-specific language. Per the Legal Advice Reddit rules, posts must be about actual personal circumstances or situations. We limit the number of labels to the top eleven classes and remove the other samples from the dataset.

1 papers0 benchmarksTexts

SOTU_QA_2023

Curated QA Benchmark on State of the Union Address 2023. It contains curated question and answers based on knowledge presented in State of the Union Address 2023 (in Feb). It is especially useful for tool-augmented LMs / ALMs to examine the model's ability in answering over private document.

1 papers0 benchmarksTexts

ELITR Minuting Corpus

ELITR Minuting Corpus in JSON format.

1 papers0 benchmarksTexts

Switchboard Dialog Act Corpus

1 papers1 benchmarksTexts

MultiSum

MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:

1 papers0 benchmarksTexts, Videos

probability_words_nli

This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP, also called verbal probabilities), e.g. words like "probably", "maybe", "surely", "impossible".

1 papers0 benchmarksTexts

Famous Keyword Twitter Replies

The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:

1 papers0 benchmarksTexts

ChinaOpen-1k

ChinaOpen is a new video dataset targeted at open-world multimodal learning, with raw data gathered from Bilibili, a popular Chinese video-sharing website. The dataset has a large webly annotated training set of videos (associated with user-generated titles and tags) and a smaller manually annotated test set of videos (with manually checked user titles / tags, manually written captions, and manual labels describing what visual objects / actions / scenes shown in the visual content).

1 papers3 benchmarksTexts, Videos

FICLE (Factual Inconsistency CLassification with Explanation)

The FICLE dataset is a derivative of the FEVER dataset, which is a collection of 185,445 claims generated by modifying sentences obtained from Wikipedia. These claims were then verified without knowledge of the original sentences they were derived from. Each sample in the FEVER dataset consists of a claim sentence, a context sentence extracted from a Wikipedia URL as evidence, and a type label indicating whether the claim is supported, refuted, or lacks sufficient information.

1 papers0 benchmarksTexts

Belfort (The Belfort dataset: Handwritten Text Recognition from Crowdsourced Annotations)

The Belfort dataset This dataset includes minutes of Belfort municipal council drawn up between 1790 and 1946. Documents include deliberations, lists of councillors, convocations, and agendas. It includes 24,105 text-line images that were automatically detected from pages. Up to 4 transcriptions are available for each line image: two from humans, and two from automatic models.

1 papers4 benchmarksImages, Texts

MSVD-Indonesian

MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset can be used for multimodal video-text tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Same as the original English dataset, the MSVD-Indonesian dataset contains about 80k video-text pairs.

1 papers34 benchmarksTexts, Videos

GenPlot (GenPlot: 500k pre-generated plots)

This dataset contains the pre-generated dataset referenced in the GenPlot Paper.

1 papers0 benchmarksImages, Texts

COVID-19 Vaccine Stance Dataset (COVID-19 Vaccination Stance with (De)Motivation Classification)

The data contains CSV files with anonymized user names, tweet texts, vaccine stance, cumulative score for the vaccine stance, location, and topic information. The file named all_predicted_cumulative_stance.csv contains all the tweets, scores, and classifications. We have broken this file into two separate files named demotivate_cumulative_stance.csv and motivate_cumulative_stance.csv, containing the demotivating and motivating tweets, respectively. We used these two files in the visualization tool presented at: https://ashiqur-rony.github.io/visualize-covid-stance/

1 papers0 benchmarksTexts

Regex101 Regular expressions

This is a dataset of regular expressions collected from regex101.com. It is not made directly available, but can be crawled from regex101.

1 papers0 benchmarksTexts

L3Cube-MahaCorpus

L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens.

1 papers0 benchmarksTexts

PreviousPage 127 of 158Next