3,148 machine learning datasets
3,148 dataset results
The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.
The Room environment - v1
Fallout New Vegas Dialog is a multilingual sentiment annotated dialog dataset from Fallout New Vegas. The game developers have preannotated every line of dialog in the game in one of the 8 different sentiments: anger, disgust, fear, happy, neutral, pained, sad and surprised and they have been translated into 5 different languages: English, Spanish, German, French and Italian.
Scan Entities in 3D (ScanEnts3D) is a large-scale dataset which provides explicit correspondences between 369k objects across 84k natural referentural sentences, covering 705 real-world scenes.
MiST (Modals In Scientific Text) is a dataset containing 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.
FreCDo is a corpus for French dialect identification comprising 413,522 French text samples collected from public news websites in Belgium, Canada, France and Switzerland.
Robust Summarization Evaluation Benchmark is a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets.
FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA Car-Manuals dataset consists of a total of 349 PDF documents from 5 car manufacturers, namely Nissan, Toyota, Mazda, Renault, Chevrolet.
FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA IKEA dataset contains 26 documents with 7366 pages total, approximately 9574 images and 23927 texts automatically extracted from those pages.
Verifee is a dataset of news articles with fine-grained trustworthiness annotations. It contains over 10, 000 unique articles from almost 60 Czech online news sources. These are categorized into one of the 4 classes across the credibility spectrum we propose, raging from entirely trustworthy articles all the way to the manipulative ones.
SimpEvalASSET is a dataset for learning learnable metrics using modern language models. It comprises of 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5.
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
This dataset for Intent classification from human speech covers 14 coarse-grained intents from the Banking domain. This work is inspired by a similar release in the Minds-14 dataset - here, we restrict ourselves to Indian English but with a much larger training set. The data was generated by 11 (Indian English) speakers and recorded over a telephony line. We also provide access to anonymized speaker information - like gender, languages spoken, and native language - to allow more structured discussions around robustness and bias in the models you train.
MENYO-20k is the first multi-domain parallel corpus with a special focus on clean orthography for Yorùbá--English with standardized train-test splits for benchmarking.
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP), e.g. words like "probably", "maybe", "surely", "impossible".
Dataset Summary The dataset used to train and evaluate TunesFormer is collected from two sources: The Session and ABCnotation.com. The Session is a community website focused on Irish traditional music, while ABCnotation.com is a website that provides a standard for folk and traditional music notation in the form of ASCII text files. The combined dataset consists of 285,449 ABC tunes, with 99\% (282,595) of the tunes used as the training set and the remaining 1\% (2854) used as the evaluation set.
AviationQA is introduced in the paper titled- There is No Big Brother or Small Brother: Knowledge Infusion in Language Models for Link Prediction and Question Answering
We have prepared a dataset, ParagraphOrdreing, which consists of around 300,000 paragraph pairs. We collected our data from Project Gutenberg. We have written an API for gathering and pre-processing in order to have the appropriate format for the defined task. Each example contains two paragraphs and a label that determines whether the second paragraph comes really after the first paragraph (true order with label 1) or the order has been reversed.
Validity and Novelty are determined in a comparative setting between two conclusions at a time. For Validity and Novelty possible labels are "Conclusion 1 is better", "tie" and "Conclusion 2 is better", for Validity and Novelty respectively.
A dataset for image editing containing >450k samples of: