Datasets

3,148 machine learning datasets

3,148 dataset results

Autoregressive Paraphrase Dataset (ARPD)

For more details see https://huggingface.co/datasets/jpwahle/autoregressive-paraphrase-dataset

MELON (Melodic Design)

A unique dataset comprising multimodal creative and designed documents containing images with corresponding captions paired with music based on around 50mood/themes.

1 papers0 benchmarksImages, Texts

Statcan Dialogue Dataset

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

1 papers1 benchmarksTables, Texts

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

1 papers0 benchmarksRanking, Texts

A Large Scale Fish Dataset (A Large-Scale Dataset for Fish Segmentation and Classification)

This dataset contains 9 different seafood types collected from a supermarket in Izmir, Turkey for a university-industry collaboration project at Izmir University of Economics, and this work was published in ASYU 2020. The dataset includes gilt head bream, red sea bream, sea bass, red mullet, horse mackerel, black sea sprat, striped red mullet, trout, shrimp image samples.

1 papers0 benchmarksImages, Texts

FewDR

FewDR is a dataset for Few-shot dense retrieval (DR). FewDR aims to effectively generalize to novel search scenarios by learning a few samples. Specifically, FewDR employs class-wise sampling to establish a standardized "few-shot" setting with finely-defined classes, reducing variability in multiple sampling rounds.

1 papers0 benchmarksTexts

PGDataset (Profile Generation Dataset)

PGDataset (Profile Generation Dataset) is a dataset created for the PGTask (Profile Generation Task), where the goal is to extract/generate a profile sentence given a dialogue utterance.

1 papers8 benchmarksDialog, Texts

XWikiRef

We provide a new data set XWikiRef for the task of Cross-lingual Multi-document Summarization. This task aims at generating Wikipedia style text in Low Resource languages by taking reference text as input. Overall, the data set contains 8 different languages: bengali (bn), english (en), hindi (hi), marathi (mr), malayalam (ml), odia (or), punjabi (pa) and tamil (ta). It also contains 5 domains: books, films, politicians, sportsman and writers.

1 papers0 benchmarksTexts

CKBP v2

CKBP v2 is a new CSKB Population benchmark, which addresses the two mentioned problems by using experts instead of crowd-sourced annotation and by adding diversified adversarial samples to make the evaluation set more representative.

1 papers0 benchmarksTexts

MIMIC-IV ICD-9

MIMIC-IV ICD-9 contains 209,326 discharge summaries—free-text medical documents—annotated with ICD-9 diagnosis and procedure codes. It contains data for patients admitted to the Beth Israel Deaconess Medical Center emergency department or ICU between 2008-2019. All codes with fewer than ten examples have been removed, and the train-val-test split was created using multi-label stratified sampling. The dataset is described further in Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study, and the code to use the dataset is found here.

1 papers18 benchmarksTexts

KnowledJe

We introduce KnowledJe, an English-language knowledge graph of antisemitic history and language from the 20th century to the present. Structured as a JSON file, KnowledJe currently contains 618 entries, which consist of 210 event names, 137 place names, 95 person names, 80 dates (years), 38 publication names, 27 organization names, and 1 product name. Each entry is associated with its own dictionary, which contains descriptions, locations, authors, and dates as applicable. We obtain the entries through four Wikipedia articles: “Timeline of antisemitism in the 20th century,” “Timeline of antisemitism in the 21stcentury,” the “Jews” section of “List of religious slurs,” and “Timeline of the Holocaust.” To obtain descriptions for each applicable key, we used the following general rules: 1. If the concept associated with the key is a slur, the description is the entry in the “Meaning, origin, and notes” column of the “List of religious slurs” article. 2. Otherwise, if the concept associate

1 papers0 benchmarksTexts

EchoKG

Echo Corpus (Arviv et al, 2021) infused with information from KnowledJe (Halevy, 2023). Algorithm detailed in Algorithm 1 in Section 3.2 of Halevy (2023) (https://arxiv.org/pdf/2304.11223.pdf). 4,630 total tweet samples, 380 labeled as antisemitic hate speech. Files included in https://github.com/enscma2/knowledje and described in its README.md. Potential use cases: detection of antisemitic hate speech.

1 papers0 benchmarksTexts

Echo Corpus

A large dataset of over 18,000,000 English tweets posted by ∼7K echo users was constructed in the following manner: 1. Base Corpus We have obtained access to a random sample of 10% of all public tweets posted in May and June 2016 – the peak use of the echo. 2. Raw Echo Corpus Searching the base corpus, we extracted all tweets containing the echo symbol, resulting in 803,539 tweets posted by 418,624 users. Filtering out non-English Tweets and users who used the echo less than three times we were left with ∼7K users. 3. Echo Corpus We used Twitter API to obtain the most recent tweets (up to 3.2K) of each of the users remainingin the English list. This process resulted in ∼18M tweets posted by 7,073 users. Some of the accounts we found using the echo were already suspended or deleted at the time of collection, thus their tweets were not retrievable. Relevant footnotes: - The echo is found in tweets written in multiple languages, particularly in East-Asian languages of which the user base

1 papers0 benchmarksTexts

LibriS2S

LibriS2S is a Speech to Speech Translation (S2ST) dataset build further upon existing resources. The dataset provides English-German speech and text quadruplets ranging just over 50 hours for both languages.

1 papers0 benchmarksSpeech, Texts

bSDD (buildingSMART Data Dictionary)

The buildingSMART Data Dictionary (bSDD) is an online service that hosts classifications and their properties, allowed values, units and translations. The bSDD allows linking between all the content inside the database. It provides a standardized workflow to guarantee data quality and information consistency.

1 papers0 benchmarksCad, Tables, Texts

Multimedia Goal-oriented Generative Script Learning Dataset

Multimedia Goal-oriented Generative Script Learning Dataset This link contains a dataset consisting of multimedia steps for two categories: gardening and crafts. The dataset consists of a total of 79,089 multimedia steps across 5,652 tasks.

1 papers0 benchmarksImages, Texts

WikiWeb2M (Wikipedia Webpage 2M)

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

1 papers0 benchmarksImages, Texts

ParsVQA-Caps

Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning. The image captioning dataset consists of over 7.5k images and about 9k captions. The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.

1 papers0 benchmarksImages, Texts

Webis-TLDR-17 Corpus

This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: - author: string (nullable = true) - body: string (nullable = true) - normalizedBody: string (nullable = true) - content: string (nullable = true) - content_len: long (nullable = true) - summary: string (nullable = true) - summary_len: long (nullable = true) - id: string (nullable = true) - subreddit: string (nullable = true) - subreddit_id: string (nullable = true) - title: string (nullable = true)

1 papers0 benchmarksTexts

SWS (Smart Word Suggestions Benchmark)

Smart Word Suggestions (SWS) is a task and benchmark. This task involves identifying words or phrases that require improvement and providing substitution suggestions. The benchmark includes human-labeled data for testing, a large distantly supervised dataset for training, and the framework for evaluation. The test data includes 1,000 sentences written by English learners, accompanied by over 16,000 substitution suggestions annotated by 10 native speakers. The training dataset comprises over 3.7 million sentences and 12.7million suggestions generated through rules.

1 papers0 benchmarksTexts

PreviousPage 126 of 158Next

Datasets

Autoregressive Paraphrase Dataset (ARPD)

MELON (Melodic Design)

Statcan Dialogue Dataset

AQL-22 (Archive Query Log)

A Large Scale Fish Dataset (A Large-Scale Dataset for Fish Segmentation and Classification)

FewDR

PGDataset (Profile Generation Dataset)

XWikiRef

CKBP v2

MIMIC-IV ICD-9

KnowledJe

EchoKG

Echo Corpus

LibriS2S

bSDD (buildingSMART Data Dictionary)

Multimedia Goal-oriented Generative Script Learning Dataset

WikiWeb2M (Wikipedia Webpage 2M)

ParsVQA-Caps

Webis-TLDR-17 Corpus

SWS (Smart Word Suggestions Benchmark)

Datasets

Autoregressive Paraphrase Dataset (ARPD)

MELON (Melodic Design)

Statcan Dialogue Dataset

AQL-22 (Archive Query Log)

A Large Scale Fish Dataset (A Large-Scale Dataset for Fish Segmentation and Classification)

FewDR

PGDataset (Profile Generation Dataset)

XWikiRef

CKBP v2

MIMIC-IV ICD-9

KnowledJe

EchoKG

Echo Corpus

LibriS2S

bSDD (buildingSMART Data Dictionary)

Multimedia Goal-oriented Generative Script Learning Dataset

WikiWeb2M (Wikipedia Webpage 2M)

ParsVQA-Caps

Webis-TLDR-17 Corpus

SWS (Smart Word Suggestions Benchmark)