3,148 machine learning datasets
3,148 dataset results
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.
This dataset contains plot summaries for 16,559 books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.
ACL-Fig is a large-scale automatically annotated corpus consisting of 112,052 scientific figures extracted from 56K research papers in the ACL Anthology. The ACL-Fig-pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories.
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.
Video Localized Narratives is a new form of multimodal video annotations connecting vision and language. The annotations are created from videos with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. It contains annotations of 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words.
MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.
YTD-18M is a large-scale corpus of 18M video-based dialogues, constructed from web videos: crucial to the data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.
LayoutBench is a diagnostic benchmark that examines 4 spatial control skills (number, position, size, shape), where each skill consists of 2 OOD layout splits, i.e., in total 8 tasks = 4 skills x 2 splits. To disentangle spatial control from other aspects of image generation, such as generating diverse objects, LayoutBench keeps the object configurations of CLEVR, and changes the spatial layouts.
MIMIC-IV ICD-10 contains 122,279 discharge summaries—free-text medical documents—annotated with ICD-10 diagnosis and procedure codes. It contains data for patients admitted to the Beth Israel Deaconess Medical Center emergency department or ICU between 2008-2019. All codes with fewer than ten examples have been removed, and the train-val-test split was created using multi-label stratified sampling. The dataset is described further in Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study, and the code to use the dataset is found here.
ViMQ is a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. It contains Vietnamese medical questions crawled from the consultation section online between patients and doctors from www.vinmec.com, a website of a Vietnamese general hospital. Each consultation consists of a question regarding a specific health issue of a patient and a detailed respond provided by a clinical expert. The dataset contains health issues that fall into a wide range of categories including common illness, cardiology, hematology, cancer, pediatrics, etc. We removed sections where users ask about information of the hospital and selected 9,000 questions for the dataset.
CoScript is a constrained language planning dataset, which consists of 55,000 scripts.
GeoGLUE is a GeoGraphic Language Understanding Evaluation benchmark, which consists of six geographic text-related tasks, including geographic textual similarity on recall, geotagged geographic elements tagging, geographic composition analysis, geographic where what cut, and geographic entity alignment. All tasks' datasets are collected from open-released resources.
AfriQA is a cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages, where relevant passages are retrieved in a high-resource language spoken in the corresponding region and answers are translated into the source language. The dataset enables the development of more equitable QA technology.
SYNTH-PEDES is a large-scale person dataset with image-text pairs by far, which contains 312,321 identities, 4,791,711 images, and 12,138,157 textual descriptions.
JDsearch is a personalized product search dataset comprised of real user queries and diverse user-product interaction types (clicking, adding to cart, following, and purchasing) collected from JD.com, a popular Chinese online shopping platform. More specifically, the authors sample about 170,000 active users on a specific date, then record all their interacted products and issued queries in one year, without removing any tail users and products. This finally results in roughly 12,000,000 products, 9,400,000 real searches, and 26,000,000 user-product interactions.
ICSI Meeting Corpus in JSON format.
This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.
Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>
The main goal of the data collection is to acquire highly natural conversations that cover a wide variety of styles and scenarios. In total, the presented corpus consists of five domains: Food, Hotel, Nightlife, Shopping mall and Sightseeing. Controlled by our various task settings, the collected dialogues cover between one to four domains per dialogue, and are thus of greatly varying length and complexity. There are 808 single-task dialogues that contains a single venue target and 4, 298 multi-task dialogues consisting of at least two to four venue targets. These different venues vary in domains most of the times.