Datasets

3,148 machine learning datasets

3,148 dataset results

Standardized Project Gutenberg Corpus

The Standardized Project Gutenberg Corpus (SPGC) is an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens.

9 papers0 benchmarksTexts

Earnings Call

The Earning Calls dataset consists of processed earning conference calls data (text and audio). It can be used to predict financial risk from both textual and vocal features from conference calls.

9 papers0 benchmarksFinancial, Texts

UDIVA is a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation and cognitive workload. The dataset consists of 90.5 hours of dyadic interactions among 147 participants distributed in 188 sessions, recorded using multiple audiovisual and physiological sensors. Currently, it includes sociodemographic, self and peer-reported personality, internal state, and relationship profiling from participants.

9 papers0 benchmarksAudio, Images, Texts, Videos

LEAF-QA

LEAF-QA, a comprehensive dataset of 250,000 densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering.

9 papers0 benchmarksTexts

WiC-TSV (Words-in-Context: Target Sense Verification)

WiC-TSV is a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, it is a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as a binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for the evaluation of a diverse set of models and systems in and across domains. WiC-TSV provides three different evaluation settings, depending on the input signals provided to the model.

9 papers18 benchmarksTexts

LDC2020T02 (Abstract Meaning Representation (AMR) Annotation Release 3.0)

Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

9 papers2 benchmarksGraphs, Texts

Ascent KB

This dataset contains 8.9M commonsense assertions extracted by the Ascent pipeline developed at the Max Planck Institute for Informatics. The focus of this dataset is on everyday concepts such as elephant, car, laptop, etc. The current version of Ascent KB (v1.0.0) is approximately 19 times larger than ConceptNet (note that, in this comparison, non-commonsense knowledge in ConceptNet such as lexical relations is excluded).

9 papers0 benchmarksTexts

CaSiNo

CaSiNo is a dataset of 1030 negotiation dialogues in English. To create the dataset, two participates take the role of campsite neighbors and negotiate for Food, Water, and Firewood packages, based on their individual preferences and requirements. This design keeps the task tractable, while still facilitating linguistically rich and personal conversations.

9 papers0 benchmarksTexts

SPARTQA (SPAtial Reasoning on Textual Question Answering)

SpartQA is a textual question answering benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior datasets and that is challenging for state-of-the-art language models (LM).

9 papers0 benchmarksTexts

NELA-GT-2018

NELA-GT-2018 is a dataset for the study of misinformation that consists of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. It includes ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust.

9 papers0 benchmarksTexts

MuSe-CaR (Multimodal Sentiment Analysis in Car Reviews)

The MuSe-CAR database is a large, multimodal (video, audio, and text) dataset which has been gathered in-the-wild with the intention of further understanding Multimodal Sentiment Analysis in-the-wild, e.g., the emotional engagement that takes place during product reviews (i.e., automobile reviews) where a sentiment is linked to a topic or entity.

9 papers0 benchmarksAudio, Texts, Videos

DL-HARD (Deep Learning Hard)

Deep Learning Hard (DL-HARD) is an annotated dataset designed to more effectively evaluate neural ranking models on complex topics. It builds on TREC Deep Learning (DL) questions extensively annotated with query intent categories, answer types, wikified entities, topic categories, and result type metadata from a leading web search engine.

9 papers0 benchmarksTexts

OntoGUM

OntoGUM is an OntoNotes-like coreference dataset converted from GUM, an English corpus covering 12 genres using deterministic rules.

9 papers1 benchmarksTexts

BiToD

BiToD is a bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling. BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base. It serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches.

9 papers0 benchmarksTexts

UA-GEC (UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language)

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

9 papers1 benchmarksTexts

28 Ghz wireless channel dataset

Our dataset which consists of multiple indoor and outdoor experiments for up to 30 m gNB-UE link. In each experiment, we fixed the location of the gNB and move the UE with an increment of roughly one degrees. The table above specifies the direction of user movement with respect to gNB-UE link, distance resolution, and the number of user locations for which we conduct channel measurements. Outdoor 30 m data also contains blockage between 3.9 m to 4.8 m. At each location, we scan the transmission beam and collect data for each beam. By doing so, we can get the full OFDM channels for different locations along the moving trajectory with all the beam angles. Moreover, we use 240 kHz subcarrier spacing, which is consistent with the 5G NR numerology at FR2, so the data we collect will be a true reflection of what a 5G UE will see.

9 papers0 benchmarksEnvironment, Images, Texts

MOD (Meme incorporated Open-domain Dialogue)

MOD is a large-scale open-domain multimodal dialogue dataset incorporating abundant Internet memes into utterances. The dataset consists of ∼45K Chinese conversations with ∼606K utterances. Each conversation contains about 13 utterances with about 4 Internet memes on average and each utterance equipped with an Internet meme is annotated with the corresponding emotion.

9 papers0 benchmarksImages, Texts

VISUELLE

VISUELLE is a repository build upon the data of a real fast fashion company, Nunalie, and is composed of 5577 new products and about 45M sales related to fashion seasons from 2016-2019. Each product in VISUELLE is equipped with multimodal information: its image, textual metadata, sales after the first release date, and three related Google Trends describing category, color and fabric popularity.

9 papers4 benchmarksImages, Texts, Time series

unarXive

A scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.

9 papers0 benchmarksTexts

ValueNet

We present a new large-scale human value dataset called ValueNet, which contains human attitudes on 21,374 text scenarios. The dataset is organized in ten dimensions that conform to the basic human value theory in intercultural research.

9 papers0 benchmarksTexts

PreviousPage 48 of 158Next