3,148 machine learning datasets
3,148 dataset results
PyTorrent contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure.
7,672 human written natural language navigation instructions for routes in OpenStreetMap with a focus on visual landmarks. Validated in Street View.
| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |
AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.
YACLC is a large scale, multidimensional annotated Chinese learner corpus. To construct the corpus, the aurhots first obtain a large number of topic-rich texts generated by Chinese as Foreign Language (CFL) learners. The authors collected and annotated 32,124 sentences written by CFL learners from the lang-8 platform. Each sentence is annotated by 10 annotators. After post processing, a total of 469,000 revised sentences are obtained.
The IUPUI-CSRC Pedestrian Situated Intent (PSI) benchmark dataset has two innovative labels besides comprehensive computer vision annotations. The first novel label is the dynamic intent changes for the pedestrians to cross in front of the ego-vehicle, achieved from 24 drivers with diverse backgrounds. The second one is the text-based explanations of the driver reasoning process when estimating pedestrian intents and predicting their behaviors during the interaction period.
The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both written in a more complex language than Klexikon, and also significantly longer, which makes this a suitable application for both summarization and simplification. In fact, previous research has so far only focused on either of the two, but not comprehensively been studied as a joint task.
Grep-BiasIR is a novel thoroughly-audited dataset which aim to facilitate the studies of gender bias in the retrieved results of IR systems.
A corpus of Offensive Language and Hate Speech Detection for Danish
A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)
HowMany-Qa is a object counting dataset. It is taken from the counting-specific union of VQA 2.0 (Goyal et al., 2017) and Visual Genome QA (Krishna et al., 2016).
Persian Question Answering Dataset (PQuAD) is a crowdsourced reading comprehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable.
This dataset is collected via the WinoGAViL game to collect challenging vision-and-language associations. Inspired by the popular card game Codenames, a “spymaster” gives a textual cue related to several visual candidates, and another player has to identify them.
The current OOD benchmark VQA-CP v2 only considers one type of shortcut (from question type to answer) and thus still cannot guarantee that the modelrelies on the intended solution rather than a solution specific to this shortcut. To overcome this limitation, VQA-VS proposes a new dataset that considers varying types of shortcuts by constructing different distribution shifts in multiple OOD test sets. In addition, VQA-VS overcomes three troubling practices in the use of VQA-CP v2, e.g., selecting models using OOD test sets, and further standardize OOD evaluation procedure. VQA-VS provides a more rigorous and comprehensive testbed for shortcut learning in VQA.