3,148 machine learning datasets
3,148 dataset results
This is a C++ vulnerability detection dataset following realistic settings. For details, please check our study Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets (Partha et al., 2024)
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, pro- viding valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance com- putational argumentation and support practical applications for debaters, edu- cators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist
VSTaR-1M is a 1M instruction tuning dataset, created using Video-STaR, with the source datasets: * Kinetics700 * STAR-benchmark * FineDiving
Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users’ preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user’s preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build
The PQAref dataset is a dataset for fine-tuning large language models for referenced question-answering in biomedical domain.
The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.
Official repository for the AnnoMI dataset: the first public collection of expert-annotated MI transcripts.
This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset.
This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset.
This is the throat microphone (laryngophone) variant of the VibraVox dataset.
This is the forehead accelerometer variant of the VibraVox dataset.
This is the temple vibration pickup variant of the VibraVox dataset.
This is the reference headset microphone variant of the VibraVox dataset.
Microscopy is a cornerstone of biomedical research, enabling detailed study of biological structures at multiple scales. Advances in cryo-electron microscopy, high-throughput fluorescence microscopy, and whole-slide imaging allow the rapid generation of terabytes of image data, which are essential for fields such as cell biology, biomedical research, and pathology. These data span multiple scales, allowing researchers to examine atomic/molecular, subcellular/cellular, and cell/tissue-level structures with high precision. A crucial first step in microscopy analysis is interpreting and reasoning about the significance of image findings. This requires domain expertise and comprehensive knowledge of biology, normal/abnormal states, and the capabilities and limitations of microscopy techniques. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers’ efficiency, identifying new image biomarkers, and accelerating hypothesis ge
This repository has a review-level multidomain multilingual dataset for Aspect-based Sentiment Analysis(ABSA) for the paper ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection.
🎯 DART-Math
Visual Haystacks (VHs) is a "visual-centric" Needle-In-A-Haystack (NIAH) benchmark specifically designed to evaluate the capabilities of Large Multimodal Models (LMMs) in visual retrieval and reasoning over sets of unrelated images. Unlike conventional NIAH challenges that center on text-related retrieval and understanding with limited anecdotal examples, VHs contains a much larger number of examples and focuses on "simple visual tasks", providing a more accurate reflection of LMMs' capabilities when dealing with extensive visual context.
Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset consists of books digitised by the British Library in partnership with Microsoft. The dataset includes ~25 million pages of out of copyright texts. The majority of the texts were published in the 18th and 19th Century, but the collection also consists of a smaller number of books from earlier periods. Items within this collection cover a wide range of subject areas, including geography, philosophy, history, poetry and literature and are published in various languages.