19,997 machine learning datasets
19,997 dataset results
PELD is a text-based emotional dialog dataset with personality traits for speakers.
MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
Kosp2e (read as `kospi'), is a corpus that allows Korean speech to be translated into English text in an end-to-end manner
MTASS is an open-source dataset in which mixtures contain three types of audio signals.
WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data.
QC-Science contains 47832 question-answer pairs belonging to the science domain tagged with labels of the form subject - chapter - topic. The dataset was collected with the help of a leading e-learning platform. The dataset consists of 40895 samples for training, 2153 samples for validation and 4784 samples for testing.
The OLGA dataset contains artist similarities from AllMusic, together with content features from AcousticBrainz. With 17,673 artists, this is the largest academic artist similarity dataset that includes content-based features to date.
The World Mortality Dataset contains weekly, monthly, or quarterly all-cause mortality data from 103 countries and territories. It contains country-level data on all-cause mortality in 2015–2021 collected from various sources.
COMPARE is a taxonomy and a dataset of comparison discussions in peer reviews of research papers in the domain of experimental deep learning.
With complex scenes and rich annotations, the PADv2 dataset can be used as a test bed to benchmark affordance detection methods and may also facilitate downstream vision tasks, such as scene understanding, action recognition, and robot manipulation.
The KTH-TIPS (Textures under varying Illumination, Pose and Scale) image database was created to extend the CUReT database in two directions, by providing variations in scale as well as pose and illumination, and by imaging other samples of a subset of its materials in different settings.
InferWiki is a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns. First, each testing sample is predictable with supportive data in the training set. Second, InferWiki initiates the evaluation following the open-world assumption and improves the inferential difficulty of the closed-world assumption, by providing manually annotated negative and unknown triples. Third, the dataset includes various inference patterns (e.g., reasoning path length and types) for comprehensive evaluation.
AutoChart is a dataset for chart-to-text generation, a task that consists on generating analytical descriptions of visual plots.
This is an entity-level Twitter Sentiment Analysis dataset. For each message, the task is to judge the sentiment of the entire sentence towards a given entity. For example, A outperforms B is positive for entity A but negative for entity B. The dataset contains ~70K labeled training messages and 1K labeled validation messages. It is available online for free on Kaggle.
ConvRef is a conversational QA benchmark with reformulations. It consists of around 11k natural conversations with about 205k reformulations. ConvRef builds upon the conversational KG-QA benchmark ConvQuestions. Questions come from five different domains: books, movies, music, TV series and soccer and answers are Wikidata entities. We used conversation sessions in ConvQuestions as input to our user study. Study participants interacted with a baseline QA system, that was trained using the available paraphrases in ConvQuestions as proxies for reformulations. Users were shown follow-up questions in a given conversation interactively, one after the other, along with the answer coming from the baseline QA system. For wrong answers, the user was prompted to reformulate the question up to four times if needed. In this way, users were able to pose reformulations based on previous wrong answers and the conversation history.
WikiNLDB is a novel dataset for training Natural Language Databases (NLDBs) which is generated by transforming structured data from Wikidata into natural language facts and queries.
Phy-Q is a benchmark that requires an agent to reason about physical scenarios and take an action accordingly. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, the authors identify 15 essential physical scenarios. For each scenario, a wide variety of distinct task templates are created, and all the task templates within the same scenario can be solved by using one specific physical rule.
This is a subset of the TREC 2005 enterprise track data, and consists of 48 topics and 200 candidates per topic, with each candidate labeled as an expert or non-expert for the topic. The task is to rank the candidates based on their expertise on a topic, using a corpus of mailing lists from the World Wide Web Consortium (W3C). This is an application where the unconstrained algorithm does better for the minority protected group.
Photometrically Distorted Synthetic COCO (PDS-COCO) dataset is a synthetically created dataset for homography estimation learning. The idea is exactly the same as in the Synthetic COCO (S-COCO) dataset with SSD-like image distortion added at the beginning of the whole procedure: the first step involves adjusting the brightness of the image using randomly picked value $\delta_b \in \mathcal{U}(-32, 32)$. Next, contrast, saturation and hue noise is applied with the following values: $\delta_c \in \mathcal{U}(0.5, 1.5)$, $\delta_s \in \mathcal{U}(0.5, 1.5)$ and $\delta_h \in \mathcal{U}(-18, 18)$. Finally, the color channels of the image are randomly swapped with a probability of $0.5$. Such a photometric distortion procedure is applied to the original image independently to create source and target candidates.