Datasets

3,148 machine learning datasets

3,148 dataset results

CUGE

CUGE is a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy, where different levels of model performance are provided based on the hierarchical framework.

4 papers0 benchmarksTexts

DAGW (Danish Gigaword)

It’s hard to develop good tools for processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is:

4 papers0 benchmarksTexts

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.

4 papers2 benchmarksTexts

PCFG SET (Probabilistic Context Free Grammar String Edit Task)

The Probabilistic Context Free Grammar String Edit Task (PCFG SET) dataset is a dataset with sequence to sequence problems specifically designed to test different aspects of compositional generalisation. In particular, the dataset contains splits to test for systematicity, productivity, substitutivity, localism and overgeneralisation.

4 papers0 benchmarksTexts

VizWiz-VQA-Grounding

The VizWiz-VQA-Grounding dataset is a dataset that visually grounds answers to visual questions asked by people with visual impairments.

4 papers0 benchmarksImages, Texts

RR (Review-Rebuttal)

Review-Rebuttal (RR) dataset is introduced to facilitate the study of argument pair extraction in the peer review and rebuttal domain.

4 papers3 benchmarksTexts

i2b2 De-identification Dataset (Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset)

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

4 papers2 benchmarksTexts

VISUELLE2.0

Visuelle 2.0 is a dataset containing real data for 5355 clothing products of the retail fast-fashion Italian company, Nuna Lie. Specifically, Visuelle 2.0 provides data from 6 fashion seasons (partitioned in Autumn-Winter and Spring-Summer) from 2017-2019, right before the Covid-19 pandemic. Each product is accompanied by an HD image, textual tags and more. The time series data are disaggregated at the shop level, and include the sales, inventory stock, max-normalized prices (for the sake of confidentiality} and discounts. Exogenous time series data is also provided, in the form of Google Trends based on the textual tags and multivariate weather conditions of the stores’ locations. Finally, we also provide purchase data for 667K customers whose identity has been anonymized, to capture personal preferences. With these data, Visuelle 2.0 allows to cope with several problems which characterize the activity of a fast fashion company: new product demand forecasting, short-observation new pr

4 papers4 benchmarksImages, Texts, Time series

SYMON (Synopses of Movie Narratives)

Contains 5,193 video summaries of popular movies and TV series. SyMoN captures naturalistic storytelling videos for human audience made by human creators, and has higher story coverage and more frequent mental-state references than similar video-language story datasets.

4 papers0 benchmarksTexts, Videos

TemporalWiki

TemporalWiki is a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM's ability to retain previous knowledge and acquire updated/new knowledge at each point in time.

4 papers0 benchmarksTexts

D3 (DBLP Discovery Dataset)

DBLP is the largest open-access repository of scientific articles on computer science and provides metadata associated with publications, authors, and venues. We retrieved more than 6 million publications from DBLP and extracted pertinent metadata (e.g., abstracts, author affiliations, citations) from the publication texts to create the D3 Discovery Dataset (D3). D3 can be used to identify trends in research activity, productivity, focus, bias, accessibility, and impact of computer science research.

4 papers0 benchmarksTexts

CoVERT (A Corpus of Fact-checked Biomedical COVID-19 Tweets)

CoVERT is a fact-checked corpus of tweets with a focus on the domain of biomedicine and COVID-19-related (mis)information. The corpus consists of 300 tweets, each annotated with medical named entities and relations. Employs a novel crowdsourcing methodology to annotate all tweets with fact-checking labels and supporting evidence, which crowdworkers search for online. This methodology results in moderate inter-annotator agreement.

4 papers0 benchmarksBiomedical, Texts

CiteSum

CiteSum is a large-scale scientific extreme summarization benchmark.

4 papers3 benchmarksTexts

FiNER-139

FiNER-139 is comprised of 1.1M sentences annotated with eXtensive Business Reporting Language (XBRL) tags extracted from annual and quarterly reports of publicly-traded companies in the US. Unlike other entity extraction tasks, like named entity recognition (NER) or contract element extraction, which typically require identifying entities of a small set of common types (e.g., persons, organizations), FiNER-139 uses a much larger label set of 139 entity types. Another important difference from typical entity extraction is that FiNER focuses on numeric tokens, with the correct tag depending mostly on context, not the token itself.

4 papers0 benchmarksTexts

Jigsaw Toxic Comment Classification Dataset

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

4 papers2 benchmarksTexts

Chilean Waiting List

The Chilean Waiting List corpus comprises de-identified referrals from the waiting list in Chilean public hospitals. A subset of 10,000 referrals (including medical and dental notes) was manually annotated with ten entity types with clinical relevance, keeping 1,000 annotations for a future shared task. A trained medical doctor or dentist annotated these referrals and then, together with three other researchers, consolidated each of the annotations. The annotated corpus has more than 48% of entities embedded in other entities or containing another. This corpus can be a useful resource to build new models for Nested Named Entity Recognition (NER). This work constitutes the first annotated corpus using clinical narratives from Chile and one of the few in Spanish.

4 papers1 benchmarksTexts

SV-Ident (Survey Variable Identification)

SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.

4 papers4 benchmarksTexts

FewSOL (A Dataset for Few-Shot Object Learning in Robotic Environments)

The Few-Shot Object Learning (FewSOL) dataset can be used for object recognition with a few images per object. It contains 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition.

4 papers0 benchmarks6D, Images, RGB-D, Texts

KRAUTS (Korpus of newspapeR Articles with Underlinded Temporal expressionS)

KRAUTS (Korpus of newspapeR Articles with Underlinded Temporal expressionS) is a German temporally annotated news corpus accompanied with TimeML annotation guidelines for German. It was developed at Fondazione Bruno Kessler, Trento, Italy and at the Max Planck Institute for Informatics, Saarbrücken, Germany. Our goal is to boost temporal tagging research for German.

4 papers3 benchmarksTexts

TimeBankPT (Portuguese TimeBank)

TimeBankPT is a corpus of Portuguese text with annotations about time. The annotation scheme used is similar to TimeML. TimeBankPT is the result of adapting the English corpus used in the first TempEval challenge to the Portuguese language.

4 papers3 benchmarksTexts

PreviousPage 70 of 158Next