3,148 machine learning datasets
3,148 dataset results
McQueen dataset contains 15k visual conversations and over 80k queries where each one is associated with a fully-specified rewrite version. In addition, for entities appearing in the rewrite, the corresponding image box annotation is provided.
CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.
VD-Ref is a dataset with ground-truth mappings from both noun phrases and pronouns to image regions. This dataset contains a set of 10k complete sets from the VisDialog dataset, and uses the StanfordCoreNLP tool to tokenize the sentences, making it proper for the succeeding human annotation.
Korean Multi-label Hate Speech Dataset
DiscoSense is a benchmark sourced from datasets that contain two sentences connected through a discourse connective. Specifically, it is sourced from two peer reviewed academic datasets, DISCOVERY and DISCOFUSE for commonsense reasoning via understanding a wide variety of discourse connectives.
YouwikiHow is a dataset for Weakly-Supervised temporal Article Grounding (WSAG). It contains 47K videos and an average of 20.8 query sentences for each video.
Reddit Engagement Dataset (RED), a distant-supervision set, with 80k single-turn conversations. RED is sourced from Reddit, sampling from 43 popular subreddits, and processed from a total of 5 million posts, filtering out data that was either non-conversational, toxic, or posts not possible to ascertain popularity.
MultiRefKGC is a dataset created from conversations from Reddit designed for Knowledge-Grounded Dialogue Generation tasks.
We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.
MUSIED is a large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service, designed for event detection tasks.
KD-EmoR is socio-behavioral emotion dataset for emotion recognition in realistic conversation scenarios. It consists of total 12289 sentences from 1513 scenes of a Korean TV show named 'Three Brothers'. The dataset is split into Training and testing sets. Each sample consists of sentence_id, person(speaker), sentence, scene_ID, context(Scene description) labeled with one of the following complex emotion labels: euphoria, dysphoria and neutral. This dataset can be used to study Emotion recognition in Korean conversations.
Gambling Address Dataset is a collection of 10,423 gambling addresses that have transactions with gambling contracts. Moreover, 51,004 non-gambling addresses are also selected (such as exchanges, wallet addresses, etc.), making the gambling address dataset more complete. In the dataset, accounts are used to refer to addresses (e.g. 0xd1ce...edec95), where 1, 0, and -1 represent the gamble, non-gamble, and other types, respectively.
Gambling Contract Dataset is a collection of 260 gambling smart contracts from decentralized gambling websites, such as Dicether, Degens. At the same time, in order to construct the negative samples required for training, 1040 smart contracts that are not involved in gambling (e.g., erc20, erc721, mixer, etc.) are selected . In the dataset, accounts are used to refer to contracts (e.g. 0x3fe2b...f8a33f), where 1, 0, and -1 to represent the gamble, non-gamble, and other types, respectively.
DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the document’s length varies greatly.
NEREL-BIO is an annotation scheme and corpus of PubMed abstracts in Russian and English. It contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer.
MTC is a financial-domain dataset of the multi-label topic classification task. It aims to identify the topics of the spoken dialogue.
PSM is a financial-domain dataset of the pairwise search matching task. It aims to identify the semantic similarity of a sentence pair in the search scenario.
IEE is a financial-domain dataset of the Insurance-entity extraction task. Its goal is to locate named entities mentioned in the input sentence.
PIZZA is a dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.
CLEVR Mental Rotation Tests (CLEVR-MRT) is a new version of the CLEVR dataset. It contains 20 images generated for each scene holding a constant altitude and sampling over azimuthal angle. It is a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint.