Datasets

3,148 machine learning datasets

3,148 dataset results

McQueen

McQueen dataset contains 15k visual conversations and over 80k queries where each one is associated with a fully-specified rewrite version. In addition, for entities appearing in the rewrite, the corresponding image box annotation is provided.

1 papers0 benchmarksImages, Texts

CoreSearch

CoreSearch is a dataset for Cross-Document Event Coreference Search. It consists of two separate passage collections: (1) a collection of passages containing manually annotated coreferring event mention, and (2) an annotated collection of destructor passages.

1 papers0 benchmarksTexts

VD-Ref

VD-Ref is a dataset with ground-truth mappings from both noun phrases and pronouns to image regions. This dataset contains a set of 10k complete sets from the VisDialog dataset, and uses the StanfordCoreNLP tool to tokenize the sentences, making it proper for the succeeding human annotation.

1 papers0 benchmarksImages, Texts

K-MHaS: Korean Multi-label Hate Speech Dataset

Korean Multi-label Hate Speech Dataset

1 papers0 benchmarksTexts

DiscoSense

DiscoSense is a benchmark sourced from datasets that contain two sentences connected through a discourse connective. Specifically, it is sourced from two peer reviewed academic datasets, DISCOVERY and DISCOFUSE for commonsense reasoning via understanding a wide variety of discourse connectives.

1 papers0 benchmarksTexts

YouwikiHow

YouwikiHow is a dataset for Weakly-Supervised temporal Article Grounding (WSAG). It contains 47K videos and an average of 20.8 query sentences for each video.

1 papers0 benchmarksTexts, Videos

Reddit Engagement Dataset

Reddit Engagement Dataset (RED), a distant-supervision set, with 80k single-turn conversations. RED is sourced from Reddit, sampling from 43 popular subreddits, and processed from a total of 5 million posts, filtering out data that was either non-conversational, toxic, or posts not possible to ascertain popularity.

1 papers0 benchmarksTexts

MultiRefKGC (multi-reference KGC)

MultiRefKGC is a dataset created from conversations from Reddit designed for Knowledge-Grounded Dialogue Generation tasks.

1 papers0 benchmarksTexts

GLAMI-1M (A Multilingual Image-Text Fashion Dataset)

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.

1 papers6 benchmarksImages, Texts

MUSIED

MUSIED is a large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service, designed for event detection tasks.

1 papers0 benchmarksTexts

KD-EmoR (Korean Drama Scene Transcript Dataset for Emotion Recognition in Conversations)

KD-EmoR is socio-behavioral emotion dataset for emotion recognition in realistic conversation scenarios. It consists of total 12289 sentences from 1513 scenes of a Korean TV show named 'Three Brothers'. The dataset is split into Training and testing sets. Each sample consists of sentence_id, person(speaker), sentence, scene_ID, context(Scene description) labeled with one of the following complex emotion labels: euphoria, dysphoria and neutral. This dataset can be used to study Emotion recognition in Korean conversations.

1 papers1 benchmarksTexts

Gambling Address Dataset

Gambling Address Dataset is a collection of 10,423 gambling addresses that have transactions with gambling contracts. Moreover, 51,004 non-gambling addresses are also selected (such as exchanges, wallet addresses, etc.), making the gambling address dataset more complete. In the dataset, accounts are used to refer to addresses (e.g. 0xd1ce...edec95), where 1, 0, and -1 represent the gamble, non-gamble, and other types, respectively.

1 papers0 benchmarksTexts

Gambling Contract Dataset

Gambling Contract Dataset is a collection of 260 gambling smart contracts from decentralized gambling websites, such as Dicether, Degens. At the same time, in order to construct the negative samples required for training, 1040 smart contracts that are not involved in gambling (e.g., erc20, erc721, mixer, etc.) are selected . In the dataset, accounts are used to refer to contracts (e.g. 0x3fe2b...f8a33f), where 1, 0, and -1 to represent the gamble, non-gamble, and other types, respectively.

1 papers0 benchmarksTexts

DeepParliament

DeepParliament is a legal domain Benchmark Dataset that gathers bill documents and metadata and performs various bill status classification tasks. The dataset text covers a broad range of bills from 1986 to the present and contains richer information on parliament bill content. There are a total of 5329 documents where 4223 are in the train and 1106 are in the test dataset. Each bill document contains many sentences in both cases, and the document’s length varies greatly.

1 papers0 benchmarksTexts

NEREL-BIO

NEREL-BIO is an annotation scheme and corpus of PubMed abstracts in Russian and English. It contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer.

1 papers0 benchmarksTexts

MTC

MTC is a financial-domain dataset of the multi-label topic classification task. It aims to identify the topics of the spoken dialogue.

1 papers0 benchmarksTexts

PSM

PSM is a financial-domain dataset of the pairwise search matching task. It aims to identify the semantic similarity of a sentence pair in the search scenario.

1 papers0 benchmarksTexts

IEE

IEE is a financial-domain dataset of the Insurance-entity extraction task. Its goal is to locate named entities mentioned in the input sentence.

1 papers0 benchmarksTexts

PIZZA

PIZZA is a dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

1 papers0 benchmarksTexts

CLEVR-MRT (CLEVR: Mental Rotation Tests)

CLEVR Mental Rotation Tests (CLEVR-MRT) is a new version of the CLEVR dataset. It contains 20 images generated for each scene holding a constant altitude and sampling over azimuthal angle. It is a controlled setting whereby questions are posed about the properties of a scene if that scene was observed from another viewpoint.

1 papers0 benchmarksImages, Texts

PreviousPage 123 of 158Next