Datasets

3,148 machine learning datasets

3,148 dataset results

PyTorrent

PyTorrent contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure.

5 papers0 benchmarksTexts

map2seq

7,672 human written natural language navigation instructions for routes in OpenStreetMap with a focus on visual landmarks. Validated in Street View.

5 papers2 benchmarksImages, Interactive, Replay data, Texts

RRS Ranking Test (Restoration-200k for Response Selection with Ranking Test Set)

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |

5 papers2 benchmarksTexts

AMALGUM (A Machine Annotated Lookalike of GUM)

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.

5 papers0 benchmarksTexts

YACLC (Yet Another Chinese Learner Corpus)

YACLC is a large scale, multidimensional annotated Chinese learner corpus. To construct the corpus, the aurhots first obtain a large number of topic-rich texts generated by Chinese as Foreign Language (CFL) learners. The authors collected and annotated 32,124 sentences written by CFL learners from the lang-8 platform. Each sentence is annotated by 10 annotators. After post processing, a total of 469,000 revised sentences are obtained.

5 papers0 benchmarksTexts

PSI (IUPUI-CSRC Pedestrian Situated Intent)

The IUPUI-CSRC Pedestrian Situated Intent (PSI) benchmark dataset has two innovative labels besides comprehensive computer vision annotations. The first novel label is the dynamic intent changes for the pedestrians to cross in front of the ego-vehicle, achieved from 24 drivers with diverse backgrounds. The second one is the text-based explanations of the driver reasoning process when estimating pedestrian intents and predicting their behaviors during the interaction period.

5 papers0 benchmarksImages, Texts

Klexikon (Klexikon: A German Dataset for Joint Summarization and Simplification)

The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both written in a more complex language than Klexikon, and also significantly longer, which makes this a suitable application for both summarization and simplification. In fact, previous research has so far only focused on either of the two, but not comprehensively been studied as a joint task.

5 papers3 benchmarksTexts

Grep-BiasIR (Gender Representation-Bias for Information Retrieval)

Grep-BiasIR is a novel thoroughly-audited dataset which aim to facilitate the studies of gender bias in the retrieved results of IR systems.

5 papers0 benchmarksTexts

DKhate

A corpus of Offensive Language and Hate Speech Detection for Danish

5 papers2 benchmarksTexts

Illness-dataset (Illness multi-domain textual dataset)

A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.

5 papers0 benchmarksTexts

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

5 papers0 benchmarksGraphs, Images, Texts

WITS (Why Is This Sarcastic?)

This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:

5 papers9 benchmarksAudio, Texts, Videos

TRECVID-AVS16 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

5 papers1 benchmarksTexts, Videos

TRECVID-AVS17 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

5 papers1 benchmarksTexts, Videos

TRECVID-AVS18 (IACC.3)

Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).

5 papers1 benchmarksTexts, Videos

MMChat

A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)

5 papers0 benchmarksImages, Texts

HowMany-QA

HowMany-Qa is a object counting dataset. It is taken from the counting-specific union of VQA 2.0 (Goyal et al., 2017) and Visual Genome QA (Krishna et al., 2016).

5 papers2 benchmarksImages, Texts

PQuAD (Persian Question Answering Dataset)

Persian Question Answering Dataset (PQuAD) is a crowdsourced reading comprehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable.

5 papers0 benchmarksTexts

WinoGAViL

This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.

5 papers2 benchmarksImages, Texts

VQA-VS (a new VQA benchmark considering Varying Shortcuts)

The current OOD benchmark VQA-CP v2 only considers one type of shortcut (from question type to answer) and thus still cannot guarantee that the modelrelies on the intended solution rather than a solution specific to this shortcut. To overcome this limitation, VQA-VS proposes a new dataset that considers varying types of shortcuts by constructing different distribution shifts in multiple OOD test sets. In addition, VQA-VS overcomes three troubling practices in the use of VQA-CP v2, e.g., selecting models using OOD test sets, and further standardize OOD evaluation procedure. VQA-VS provides a more rigorous and comprehensive testbed for shortcut learning in VQA.

5 papers0 benchmarksImages, Texts

PreviousPage 63 of 158Next