Datasets

3,148 machine learning datasets

3,148 dataset results

RealVul (RealVul-Vulnerability Dataset following realistic settings)

This is a C++ vulnerability detection dataset following realistic settings. For details, please check our study Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets (Partha et al., 2024)

1 papers0 benchmarksTexts

OpenDebateEvidence

We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, pro- viding valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance com- putational argumentation and support practical applications for debaters, edu- cators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist

1 papers0 benchmarksTexts

VSTaR-1M

VSTaR-1M is a 1M instruction tuning dataset, created using Video-STaR, with the source datasets: * Kinetics700 * STAR-benchmark * FineDiving

1 papers0 benchmarksTexts, Videos

MuseChat Dataset (MuseChat: A Conversational Music Recommendation System for Videos (CVPR 2024 Highlight Paper))

Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users’ preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user’s preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build

1 papers0 benchmarksAudio, Texts, Videos

PQAref (Pubmed Question Answering with references)

The PQAref dataset is a dataset for fine-tuning large language models for referenced question-answering in biomedical domain.

1 papers0 benchmarksTexts

MAVE - Attribute: Black Tea Variety (MAVE - Attribute: Black Tea Variety: A Product Dataset for Multi-source Attribute Value Extraction)

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

1 papers0 benchmarksTexts

AnnoMI dataset

Official repository for the AnnoMI dataset: the first public collection of expert-annotated MI transcripts.

1 papers0 benchmarksTexts

VibraVox (rigid in-ear microphone)

This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (soft in-ear microphone)

This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (throat microphone)

This is the throat microphone (laryngophone) variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (forehead accelerometer)

This is the forehead accelerometer variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (temple vibration pickup)

This is the temple vibration pickup variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (headset microphone)

This is the reference headset microphone variant of the VibraVox dataset.

1 papers4 benchmarksAudio, Speech, Texts

uBench (MicroBench)

Microscopy is a cornerstone of biomedical research, enabling detailed study of biological structures at multiple scales. Advances in cryo-electron microscopy, high-throughput fluorescence microscopy, and whole-slide imaging allow the rapid generation of terabytes of image data, which are essential for fields such as cell biology, biomedical research, and pathology. These data span multiple scales, allowing researchers to examine atomic/molecular, subcellular/cellular, and cell/tissue-level structures with high precision. A crucial first step in microscopy analysis is interpreting and reasoning about the significance of image findings. This requires domain expertise and comprehensive knowledge of biology, normal/abnormal states, and the capabilities and limitations of microscopy techniques. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers’ efficiency, identifying new image biomarkers, and accelerating hypothesis ge

1 papers0 benchmarksBiology, Biomedical, Images, Texts

ROAST (Review level Opinion Aspect Sentiment Target Joint Detection for ABSA)

This repository has a review-level multidomain multilingual dataset for Aspect-based Sentiment Analysis(ABSA) for the paper ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection.

1 papers0 benchmarksTexts

DART-Math-Uniform

🎯 DART-Math

1 papers0 benchmarksTexts

Visual Haystacks (VHs)

Visual Haystacks (VHs) is a "visual-centric" Needle-In-A-Haystack (NIAH) benchmark specifically designed to evaluate the capabilities of Large Multimodal Models (LMMs) in visual retrieval and reasoning over sets of unrelated images. Unlike conventional NIAH challenges that center on text-related retrieval and understanding with limited anecdotal examples, VHs contains a much larger number of examples and focuses on "simple visual tasks", providing a more accurate reflection of LMMs' capabilities when dealing with extensive visual context.

1 papers0 benchmarksImages, Texts

Sieve & Swap - HowTo100M (Cooking)

Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than

1 papers0 benchmarksTexts, Videos

Spanish Corpus XIX (19th Century Spanish Corpus)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

blbooks (The British Library Books)

This dataset consists of books digitised by the British Library in partnership with Microsoft. The dataset includes ~25 million pages of out of copyright texts. The majority of the texts were published in the 18th and 19th Century, but the collection also consists of a smaller number of books from earlier periods. Items within this collection cover a wide range of subject areas, including geography, philosophy, history, poetry and literature and are published in various languages.

1 papers0 benchmarksTexts

PreviousPage 138 of 158Next