Datasets

3,148 machine learning datasets

3,148 dataset results

MINT (a Multi-modal Image and Narrative Text Dubbing Dataset)

Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audi

1 papers0 benchmarksAudio, Images, Texts, Videos

CoSQA+ (CoSQA_Plus)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

PolyNews

PolyNews is a multilingual dataset containing news titles in 77 languages and 19 scripts.

1 papers0 benchmarksTexts

PolyNewsParallel

PolyNews is a multilingual parallel dataset containing news titles 833 language pairs, spanning in 64 languages and 17 scripts.

1 papers0 benchmarksTexts

BenBench

BenBench is designed to benchmark the potential for data leakage in benchmark datasets, which can lead to biased and inequitable comparisons. In this work, we are not pursuing technical contributions in system development; instead, we are attempting to encourage the healthy development of this field, particularly through the lens of mathematical reasoning tasks, in the following aspects:

1 papers0 benchmarksTexts

VietMed-NER

Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed

1 papers0 benchmarksTexts

Twitter-SMNER

This task aims to extract named entities and entity types while further predicting segmentation masks of visual objects.

1 papers1 benchmarksImages, Texts

TADAC (Text Annotated Distortion, Appearance and Content Dataset)

We have developed a systematic method for constructing large text annotated image databases designed for exploiting vision-language modeling for image quality assessment and present the Text Annotated Distortion, Appearance and Content (TADAC) database containing over 1.6 million images annotated with texts about their semantic contents, distortion characteristics and appearance properties. We used existing labels or automatic image captioning to annotate the semantic content, designed a list of suitable textual phrases for describing the distortion characteristics, and developed automatic algorithms for computing the appearance properties and annotated these properties with suitable textual descriptions. The TADAC database is the first of its kind that is annotated with all three types of quality relevant texts to enable the learning of high level knowledge about all possible factors affecting image quality. TADAC has enabled the development of the first BIQA model (SLIQUE) that joint

1 papers0 benchmarksImages, Texts

SAD-Instruct (Situational Awareness Database for Instruct-Tuning)

The Situational Awareness Database for Instruct-Tuning (SAD-Instruct) is a dataset for dynamic task guidance. It contains situationally aware instructions for performing everyday tasks or completing scenarios in 3D environments. The dataset provides step-by-step instructions for these scenarios grounded in the situation's context. This context is defined through a scenario-specific scene graph that captures the objects, attributes, and environmental relations. The dataset is designed to enable research in grounded language learning, instruction following, and situated dialogue.

1 papers0 benchmarksTexts

Human-ChatGPT texts

A dataset including texts by humans (labeled 0) and then rephrased by ChatGPT (labeled 1), created to train models for machine-generated text detection.

1 papers0 benchmarksTexts

RES-Q (RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale)

RES-Q is a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 handcrafted repository editing tasks derived from real GitHub commits. Given an edit instruction and a code repository, RES-Q evaluates an LLM system’s ability to interpret edit instructions, gather information, and construct appropriate edits to the repository.

1 papers1 benchmarksTexts

AGB-DE

AGB-DE is a legal NLP corpus for the automated detection of potentially void clauses in German standard form consumer contracts. It consists of 3,764 clauses that have been legally assessed by experts and annotated as potentially void (1) or valid (0). Additionally, each clause is annotated with a topic label.

1 papers1 benchmarksTexts

CAsT-answerability

CAsT-answerability dataset contains binary answerability labels on three levels: sentence, passage, and ranking. It contains around 1.8k answerable and 1.9k unanswerable question-passage pairs. Sentence- and passage-level answerability is divided into train (90%), and test (10%) portions; the splitting is done on the question level to avoid information leakage. Ranking-level answerability has only a test set.

1 papers0 benchmarksTexts

GeoQuestions1089

GeoQuestions1089 is a crowdsourced geospatial question-answering dataset that targets the Knowledge Graph YAGO2geo. It contains 1089 triples of geospatial questions, their answers, and the respective SPARQL/GeoSPARQL queries. It has been used to benchmark two state of the art Question Answering engines, GeoQA2 and the engine of Hamzei et al.

1 papers1 benchmarksTexts

Predictive Model for Assessing Knee Muscle Injury Risk in Athletes and Non-Athletes Using sEMG

Dataset Description High-level explanation of dataset characteristics: This dataset includes electromyographic (EMG) signals captured using the BiTalino device. EMG signals were recorded during three conditions: rest, additional weight exercise, and squat activity. Data were collected from four young subjects aged between 20 and 24 years, including both athletes and non-athletes.

1 papers0 benchmarksTexts

VietMed-Sum

In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.

1 papers0 benchmarksAudio, Medical, Texts

Motion Tracking of Bionic Tendon-Driven Robot

This is a dataset of robot motions based on physics simulations.

1 papers0 benchmarksTexts, Tracking

COVID-19 Tweets with Motivation and Topics

The dataset contains Tweet IDs along with the location and tweet timestamp. The tweets are labeled based on motivating/demotivating status, stance towards the COVID-19 vaccine, and topic in the tweet text. To comply with Twitter guidelines, we removed the tweet texts and author information.

1 papers0 benchmarksTexts

COVID-19-TweetIDs (Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set)

Since the inception of our collection, we have actively maintained and updated our GitHub repository on a weekly basis. We have published over 123 million tweets, with over 60% of the tweets in English. This paper also presents basic statistics that show that Twitter activity responds and reacts to COVID-19-related events.

1 papers0 benchmarksTexts

Vulnerability Java Dataset

The dataset consists of two versions: $X_1$ with $P_3$ and $X_1$ without $P_3$, where $P_3$ represents a set of random unchanged functions from vulnerability fixing commits. This dataset is designed for finetuning large language models to detect vulnerabilities in code. It can be used for training and evaluating models in automated vulnerability detection tasks.

1 papers2 benchmarksTexts

PreviousPage 137 of 158Next