3,148 machine learning datasets
3,148 dataset results
GQA-OOD is a new dataset and benchmark for the evaluation of VQA models in OOD (out of distribution) settings.
A multimodal empathetic dialogue dataset.
Data set constructed from YouTube comments (72,098 comments posted by 43,859 users on 623 relevant videos to the crisis)
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.
Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL
We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats, such as podcasts, lectures, news, corporate events \& promotional content, and, more broadly, videos from individual content creators. We refer to the paper for further information.
The dataset consists of source code and LLVM IR pairs generated from accepted and de-duped programming contest solutions. The dataset is divided into language configs and mode splits. The language can be one of C, C++, D, Fortran, Go, Haskell, Nim, Objective-C, Python, Rust and Swift, indicating the source files' languages. The mode split indicates the compilation mode, which can be wither Size_Optimized or Perf_Optimized.
DISL The full dataset report is available at: https://arxiv.org/abs/2403.16861
In Visual Query Detection (VQD), a system is given a query (prompt) natural language and an image, and then the system must produce 0 - N boxes that satisfy that query. VQD is related to several other tasks in computer vision, but it captures abilities these other tasks ignore. Unlike object detection, VQD can deal with attributes and relations among objects in the scene. In VQA, often algorithms produce the right answers due to dataset bias without `looking' at relevant image regions. Referring Expression Recognition (RER) datasets have short and often ambiguous prompts, and by having only a single box as an output, they make it easier to exploit dataset biases. VQD requires goal-directed object detection and outputting a variable number of boxes that answer a query.
The E-MASAC Dataset is a collection of code-mixed conversations sourced from an Indian TV series, focusing on Hindi-English interactions. It was derived from the MASAC dataset and specifically annotated for Emotion Recognition in Conversations (ERC) tasks. The dataset comprises 8,607 dialogues with 11,440 utterances, containing instances of sarcasm and humor. Emotions such as anger, fear, joy, sadness, surprise, contempt, and neutral are annotated for each utterance by three fluent English and Hindi-speaking linguists, ensuring a high inter-annotator agreement of 0.85.
Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fueling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.
Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.
The process by which sections in a document are demarcated and labeled is known as section identification. Such sections are helpful to the reader when searching for information and contextualizing specific topics. The goal of this work is to segment the sections of clinical medical domain documentation. The primary contribution of this work is MedSecId, a publicly available set of 2,002 fully annotated medical notes from the MIMIC-III. We include several baselines, source code, a pretrained model and analysis of the data showing a relationship between medical concepts across sections using principal component analysis.
https://huggingface.co/datasets/jdustinwind/Polite
Audio-alpaca: A preference dataset for aligning text-to-audio models Audio-alpaca is a pairwise preference dataset containing about 15k (prompt,chosen, rejected) triplets where given a textual prompt, chosen is the preferred generated audio and rejected is the undesirable audio.
Given a question and passage in an Indic language, generate a short answer span from the passage as the answer.
Given a sentence in the source language, generate a translation in the target language. The data contains translation pairs in two directions, English → TargetLanguage and TargetLanguage → English
Given an English article, generate a short summary in the target language.
Given a question in an Indic language and a passage in English, generate a short answer span. We provide both an English and target language answer span in the annotations.
The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs. The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the