3,148 machine learning datasets
3,148 dataset results
VedantaNY-10M is a curated dataset of over 750 hours of transcripts from public discourses on the Indian philosophy of Advaita Vedanta. Sourced from 612 YouTube lectures by Swami Sarvapriyananda of the Vedanta Society of New York (VSNY), the dataset contains ~10 million tokens. These lectures offer a comprehensive exposition of Advaita Vedanta, making the dataset an invaluable resource for philosophy and linguistics research.
CFEVER is a Chinese Fact Extraction and VERification dataset published at AAAI 2024. CFEVER comprises 30,012 manually created claims based on content in Chinese Wikipedia. Inspired by the FEVER dataset (Thorne et al., 2018), we provide class labels (Supports, Refutes, or Not Enough Information) and evidence for each claim in the CFEVER dataset. Download the dataset: https://github.com/IKMLab/CFEVER-data
For testing refusal behavior in a language-specific setting, we introduce HiXSTest — a set of manually curated prompts in the Hindi language designed to measure exaggerated safety. It comprises 25 safe-unsafe pairs of prompts, carefully phrased to challenge the LLMs’ safety boundaries.
Spectral Detection and Analysis Based Paper(SDAAP) dataset is the first open-source textual knowledge dataset for spectral analysis and detection and contains annotated literature data as well as corresponding knowledge instruction data, which contains a total of 4461 thesis accessible in full-text format from reputable publishers like Nature, Springer, Elsevier, MDPI, among others.
Poker Hand Histories A collection of poker hand histories, covering 11 poker variants, in the poker hand history (PHH) format.
BPersona-chat is an evaluation dataset based on the English multiturn chat corpus Persona-chat and the Japanese multiturn chat corpus JPersona-chat.
The JPersonaChat dataset is built by NTT CS LAB for Japanese dialog transformers models. Please refer to https://github.com/nttcslab/japanese-dialog-transformers/tree/main?tab=readme-ov-file for detailed information.
The OnlySports Dataset is a comprehensive collection of sports-related text data, comprising approximately 600 billion tokens. This massive corpus was carefully curated from the FineWeb dataset, a cleaned subset of CommonCrawl spanning from 2013 to present. The dataset creation involved a two-step process:
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset is made of 6366 threads collected from the r/AmITheAsshole community on Reddit. The dataset contains a total of 6,372,251 comments. The collected threads constitute the “top” submissions — those having the highest score, measured as the difference between upvotes and downvotes of a post. We downloaded them using PRAW, running 10 different queries across various temporal scopes, and then cleaning the obtained dataset by removing duplicated threads. Please refer to the paper, specifically to Table 3, for more details about the dataset.
A bilingual (English and Chinese natural language queries) dataset which has NL queries annotated with their corresponding GQL queries (i.e. nGQL). Each data sample in the train data contains 4 pieces of information: prompt represents a natural language query, content represents a standard nGQL, reason represents the inference part that needs to be output by the reranker, and schema represents the code structure schema corresponding to this sentence. Each data sample in the test data contains 6 pieces of information, prompt represents natural language query, content represents gold nGQL, text_schema is used for the vanilla experiment, schema represents the code structure schema corresponding to this sentence, class represents which graph database space this sentence corresponds to, and result represents the results obtained using gold nGQL.
Click to add a brisef description of the datdaset (Markdown and LaTeX enabled).
a high-level explanation of the dataset characteristics We introduce WikiOFGraph, a novel large-scale, domain-diverse dataset synthesized by LLMs, ensuring superior graph-text consistency to advance general-domain graph-to-text generation.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
DailyMoth-70h is a fully self-contained ASL-to-English sign language dataset containing over 70h of video (48K clips) with aligned English captions of a single native ASL signer (white, male, and early middle-aged) from the ASL news channel TheDailyMoth. The primary purpose of the dataset is to be used as a benchmark and analysis dataset for (gloss-free) sign language translation.
This dataset contains 7984 user comments from an Austrian online newspaper. The comments have been annotated by 4 or more out of 11 annotators as to how strong sexism/mysogyny is present in the comment. It was used in the GermEval 2024 Shared Task 1: GerMS-Detect to evaluate data-driven approaches to automatically detect sexism in user comments.
Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection.
Dataset Card for SemTabNet This dataset accompanies the following paper:
A collection of natural language prompt-completion pairs pertaining to multiple-choice Q&A on benchmark tasks based on US census products. Benchmark tasks are made available through a python package dubbed folktexts. The main goal is to serve as a basis to evaluate LLMs' capabilities of uncertainty quantification on uncertain outcomes, i.e., evaluating quantification of aleatoric uncertainty. This is essentially a natural-language version of the popular folktables tabular data package.
A Chinese sign language dataset that includes dialogue information.