Datasets

3,148 machine learning datasets

3,148 dataset results

TAC 2010

TAC 2010 is a dataset for summarization that consists of 44 topics, each of which is associated with a set of 10 documents. The test dataset is composed of approximately 44 topics, divided into five categories: Accidents and Natural Disasters, Attacks, Health and Safety, Endangered Resources, Investigations and Trials.

4 papers0 benchmarksTexts

DukeMTMC-attribute

The images in DukeMTMC-attribute dataset comes from Duke University. There are 1812 identities and 34183 annotated bounding boxes in the DukeMTMC-attribute dataset. This dataset contains 702 identities for training and 1110 identities for testing, corresponding to 16522 and 17661 images respectively. The attributes are annotated in the identity level, every image in this dataset is annotated with 23 attributes.

4 papers2 benchmarksImages, Texts, Videos

DSTC7 Task 2 (Dialog System Technology Challenges Task 2)

DSTC Task 2 is a dataset and task for end-to-end conversation modeling. The goal is to generate conversational responses that go beyond trivial chitchat by injecting informative responses that are grounded in external knowledge. The data consists of conversational data from Reddit, and contextually-relevant “facts” taken from the website that started the Reddit conversation. That is the setup is grounded, as each conversation in the data is about a specific web page that was linked at the start of the conversation.

4 papers0 benchmarksTexts

AndroidHowTo

AndroidHowTo contains 32,436 data points from 9,893 unique How-To instructions and split into training (8K), validation (1K) and test (900). All test examples have perfect agreement across all three annotators for the entire sequence. In total, there are 190K operation spans, 172K object spans, and 321 input spans labeled. The lengths of the instructions range from 19 to 85 tokens, with median of 59. They describe a sequence of actions from one to 19 steps, with a median of 5.

4 papers0 benchmarksTexts

CODA-19

CODA-19 is a human-annotated dataset that denotes the Background, Purpose, Method, Finding/Contribution, and Other for 10,966 English abstracts in the COVID-19 Open Research Dataset.

4 papers0 benchmarksMedical, Texts

RWWD (Real World Worry Dataset)

Real World Worry Dataset (RWWD) captures the emotional responses of UK residents to COVID-19 at a point in time where the impact of the COVID19 situation affected the lives of all individuals in the UK. The data were collected on the 6th and 7th of April 2020, a time at which the UK was under lockdown (news, 2020), and death tolls were increasing. On April 6, 5,373 people in the UK had died of the virus, and 51,608 tested positive. On the day before data collection, the Queen addressed the nation via a television broadcast. Furthermore, it was also announced that Prime Minister Boris Johnson was admitted to intensive care in a hospital for COVID-19 symptoms.

4 papers0 benchmarksTexts

Microsoft Research Multimodal Aligned Recipe Corpus

To construct the MICROSOFT RESEARCH MULTIMODAL ALIGNED RECIPE CORPUS the authors first extract a large number of text and video recipes from the web. The goal is to find joint alignments between multiple text recipes and multiple video recipes for the same dish. The task is challenging, as different recipes vary in their order of instructions and use of ingredients. Moreover, video instructions can be noisy, and text and video instructions include different levels of specificity in their descriptions.

4 papers0 benchmarksTexts

MultiSense

MultiSense is a dataset of 9,504 images annotated with an English verb and its translation in Spanish and German.

4 papers0 benchmarksImages, Texts

PASTEL

PASTEL is a parallelly annotated stylistic language dataset. The dataset consists of ~41K parallel sentences and 8.3K parallel stories annotated across different personas.

4 papers0 benchmarksTexts

OLPBENCH

OLPBENCH is a large Open Link Prediction benchmark, which was derived from the state-of-the-art Open Information Extraction corpus OPIEC (Gashteovski et al., 2019). OLPBENCH contains 30M open triples, 1M distinct open relations and 2.5M distinct mentions of approximately 800K entities.

4 papers0 benchmarksTexts

STAIR Captions

STAIR Captions is a large-scale dataset containing 820,310 Japanese captions. This dataset can be used for caption generation, multimodal retrieval, and image generation.

4 papers0 benchmarksImages, Texts

VideoNavQA

The VideoNavQA dataset contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the Embodied Question Answering (EQA) task.

4 papers0 benchmarksTexts, Videos

ArCOV19-Rumors

ArCOV19-Rumors is an Arabic COVID-19 Twitter dataset for misinformation detection composed of tweets containing claims from 27th January till the end of April 2020.

4 papers0 benchmarksTexts

CSPubSum

CSPubSum is a dataset for summarisation of computer science publications, created by exploiting a large resource of author provided summaries and show straightforward ways of extending it further.

4 papers0 benchmarksTexts

Danbooru2020

A large-scale anime image database with 4.2m+ images annotated with 130m+ text tags describing image contents in detail; it can be useful for machine learning purposes such as image recognition and generation. It has been applied to a wide variety of applications, particularly generative modeling.

4 papers0 benchmarksImages, Texts

EMU (Edited Media Understanding)

48k question-answer pairs written in rich natural language.

4 papers0 benchmarksImages, Texts

Finer (Finnish News Corpus for Named Entity Recognition)

Finnish News Corpus for Named Entity Recognition (Finer) is a corpus that consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event,and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source.

4 papers0 benchmarksTexts

Gazeta

Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, validation, and test datasets, these pairs were sorted by time and the first 52,400 pairs are used as the training dataset, the proceeding 5,265 pairs as the validation dataset, and the remaining 5,770 pairs as the test dataset.

4 papers5 benchmarksTexts

GeoCoV19

GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.

4 papers0 benchmarksTexts

GGPONC (German Guideline Program in Oncology NLP Corpus)

German Guideline Program in Oncology NLP Corpus (GGPONC) is a German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions.

4 papers0 benchmarksTexts

PreviousPage 66 of 158Next