Datasets

3,148 machine learning datasets

3,148 dataset results

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

3 papers0 benchmarksTexts

PubMedCite

PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.

3 papers0 benchmarksBiomedical, Texts

CMU Book Summary Dataset

This dataset contains plot summaries for 16,559 books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

3 papers0 benchmarksTexts

ACL-Fig

ACL-Fig is a large-scale automatically annotated corpus consisting of 112,052 scientific figures extracted from 56K research papers in the ACL Anthology. The ACL-Fig-pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories.

3 papers0 benchmarksImages, Texts

Tasksource

Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.

3 papers0 benchmarksTexts

Video Localized Narratives

Video Localized Narratives is a new form of multimodal video annotations connecting vision and language. The annotations are created from videos with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. It contains annotations of 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words.

3 papers0 benchmarksTexts

MultiQ

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.

3 papers1 benchmarksTexts

YTD-18M

YTD-18M is a large-scale corpus of 18M video-based dialogues, constructed from web videos: crucial to the data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.

3 papers0 benchmarksTexts, Videos

LayoutBench

LayoutBench is a diagnostic benchmark that examines 4 spatial control skills (number, position, size, shape), where each skill consists of 2 OOD layout splits, i.e., in total 8 tasks = 4 skills x 2 splits. To disentangle spatial control from other aspects of image generation, such as generating diverse objects, LayoutBench keeps the object configurations of CLEVR, and changes the spatial layouts.

3 papers1 benchmarksImages, Texts

MIMIC-IV ICD-10

MIMIC-IV ICD-10 contains 122,279 discharge summaries—free-text medical documents—annotated with ICD-10 diagnosis and procedure codes. It contains data for patients admitted to the Beth Israel Deaconess Medical Center emergency department or ICU between 2008-2019. All codes with fewer than ten examples have been removed, and the train-val-test split was created using multi-label stratified sampling. The dataset is described further in Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study, and the code to use the dataset is found here.

3 papers18 benchmarksTexts

ViMQ

ViMQ is a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. It contains Vietnamese medical questions crawled from the consultation section online between patients and doctors from www.vinmec.com, a website of a Vietnamese general hospital. Each consultation consists of a question regarding a specific health issue of a patient and a detailed respond provided by a clinical expert. The dataset contains health issues that fall into a wide range of categories including common illness, cardiology, hematology, cancer, pediatrics, etc. We removed sections where users ask about information of the hospital and selected 9,000 questions for the dataset.

3 papers0 benchmarksTexts

CoScript

CoScript is a constrained language planning dataset, which consists of 55,000 scripts.

3 papers0 benchmarksTexts

GeoGLUE (GeoGraphic Language Understanding Evaluation Benchmark)

GeoGLUE is a GeoGraphic Language Understanding Evaluation benchmark, which consists of six geographic text-related tasks, including geographic textual similarity on recall, geotagged geographic elements tagging, geographic composition analysis, geographic where what cut, and geographic entity alignment. All tasks' datasets are collected from open-released resources.

3 papers0 benchmarksTexts

AfriQA

AfriQA is a cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages, where relevant passages are retrieved in a high-resource language spoken in the corresponding region and answers are translated into the source language. The dataset enables the development of more equitable QA technology.

3 papers0 benchmarksTexts

SYNTH-PEDES

SYNTH-PEDES is a large-scale person dataset with image-text pairs by far, which contains 312,321 identities, 4,791,711 images, and 12,138,157 textual descriptions.

3 papers0 benchmarksImages, Texts

JDsearch

JDsearch is a personalized product search dataset comprised of real user queries and diverse user-product interaction types (clicking, adding to cart, following, and purchasing) collected from JD.com, a popular Chinese online shopping platform. More specifically, the authors sample about 170,000 active users on a specific date, then record all their interacted products and issued queries in one year, without removing any tail users and products. This finally results in roughly 12,000,000 products, 9,400,000 real searches, and 26,000,000 user-product interactions.

3 papers0 benchmarksTexts

ICSI Meeting Corpus

ICSI Meeting Corpus in JSON format.

3 papers1 benchmarksTexts

Zambezi Voice

This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.

3 papers0 benchmarksAudio, Images, Texts

RESD (Russian Emotional Speech Dialogs with annotated text)

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>

3 papers6 benchmarksAudio, Speech, Texts

MMConv

The main goal of the data collection is to acquire highly natural conversations that cover a wide variety of styles and scenarios. In total, the presented corpus consists of five domains: Food, Hotel, Nightlife, Shopping mall and Sightseeing. Controlled by our various task settings, the collected dialogues cover between one to four domains per dialogue, and are thus of greatly varying length and complexity. There are 808 single-task dialogues that contains a single venue target and 4, 298 multi-task dialogues consisting of at least two to four venue targets. These different venues vary in domains most of the times.

3 papers7 benchmarksImages, Texts

PreviousPage 81 of 158Next