Datasets

3,148 machine learning datasets

3,148 dataset results

HumAID (Human-Annotated Disaster Incidents Data)

Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models.

8 papers0 benchmarksTexts

RIMES (Reconnaissance & Indexation de données Manuscrites et de fac similÉS / Recognition & Indexing of handwritten documents & faxes)

The RIMES database (Reconnaissance et Indexation de données Manuscrites et de fac similÉS / Recognition and Indexing of handwritten documents and faxes) was created to evaluate automatic systems of recognition and indexing of handwritten letters. Of particular interest are cases such as those sent by postal mail or fax by individuals to companies or administrations.

8 papers0 benchmarksImages, Texts

ManyTypes4Py

ManyTypes4Py is a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files.

8 papers0 benchmarksTexts

GermanQuAD

GermanQuAD is a Question Answering (QA) dataset of 13,722 extractive question/answer pairs in German.

8 papers0 benchmarksTexts

CHIP-STS (Semantic Textual Similarity Dataset)

CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for the CHIP-STS task. Specifically, the task aims to transfer learning between disease types on Chinese disease questions and answer data. Given question pairs related to 5 different diseases (The disease types in the training and testing set are different), the task intends to determine whether the semantics of the two sentences are similar.

8 papers3 benchmarksTexts

CHIP-CDN (Clinical Diagnosis Normalization Dataset)

CHIP Clinical Diagnosis Normalization, a dataset that aims to standardize the terms from the final diagnoses of Chinese electronic medical records, is used for the CHIP-CDN task. Given the original phrase, the task is required to normalize it to standard terminology based on the International Classification of Diseases (ICD-10) standard for Beijing Clinical Edition v601.

8 papers0 benchmarksTexts

DocNLI

DocNLI is a large-scale dataset for document-level NLI. DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DocNLI has pretty limited artifacts which unfortunately widely exist in some popular sentence-level NLI datasets.

8 papers0 benchmarksTexts

RUSSE (Russian Words in Context (based on RUSSE))

WiC: The Word-in-Context Dataset A reliable benchmark for the evaluation of context-sensitive word embeddings.

8 papers1 benchmarksTexts

Who’s Waldo

Who's Waldo is a dataset of 270K image–caption pairs, depicting interactions of people, that is automatically mined from Wikimedia Commons. It is a benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.

8 papers1 benchmarksImages, Texts

SemEval-2021 Task-11

NLPContributionGraph was introduced as Task 11 at SemEval 2021 for the first time. The task is defined on a dataset of Natural Language Processing (NLP) scholarly articles with their contributions structured to be integrable within Knowledge Graph infrastructures such as the Open Research Knowledge Graph. The structured contribution annotations are provided as (1) Contribution sentences : a set of sentences about the contribution in the article; (2) Scientific terms and relations: a set of scientific terms and relational cue phrases extracted from the contribution sentences; and (3) Triples: semantic statements that pair scientific terms with a relation, modeled toward subject-predicate-object RDF statements for KG building. The Triples are organized under three (mandatory) or more of twelve total information units (viz., ResearchProblem, Approach, Model, Code, Dataset, ExperimentalSetup, Hyperparameters, Baselines, Results, Tasks, Experiments, and AblationAnalysis).

8 papers0 benchmarksTexts

WildReceipt

WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class.

8 papers0 benchmarksImages, Texts

CMU Movie Summary Corpus

Dataset [46 M] and readme: 42,306 movie plot summaries extracted from Wikipedia + aligned metadata extracted from Freebase, including: Movie box office revenue, genre, release date, runtime, and language Character names and aligned information about the actors who portray them, including gender and estimated age at the time of the movie's release Supplement: Stanford CoreNLP-processed summaries [628 M]. All of the plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref).

8 papers0 benchmarksTexts

RRS (Restoration-200k for Response Selection)

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |

8 papers6 benchmarksTexts

GVFC (Gun Violence Frame Corpus)

This is a new dataset of news headlines and their frames related to the issue of gun violence in the United States. This Gun Violence Frame Corpus (GVFC) was curated and annotated by journalism and communication experts. The articles in this dataset are drawn from a sample of news articles from a list of 30 top U.S. news websites defined in terms of traffic to the websites; and collected from four time periods over the course of 2018 in order to capture a diversity of articles.

8 papers0 benchmarksImages, Texts

SoundDescs

We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, audio captioning etc. This dataset contains 32,979 pairs of audio files and text descriptions. There are 23 categories found in SoundDescs including but not limited to nature, clocks, fire etc.

8 papers2 benchmarksAudio, Texts

QA2D (Question to Declarative Sentence (QA2D) Dataset)

The Question to Declarative Sentence (QA2D) Dataset contains 86k question-answer pairs and their manual transformation into declarative sentences. 95% of question answer pairs come from SQuAD (Rajkupar et al., 2016) and the remaining 5% come from four other question answering datasets.

8 papers0 benchmarksTexts

CLEVR-Math

CLEVR-Math is a multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction, represented partly by a textual description and partly by an image illustrating the scenario. These word problems requires a combination of language, visual and mathematical reasoning.

8 papers0 benchmarksImages, Texts

EUR-Lex-Sum

EUR-Lex-Sum is a dataset for cross-lingual summarization. It is based on manually curated document summaries of legal acts from the European Union law platform. Documents and their respective summaries exist as crosslingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. The dataset contains up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.

8 papers0 benchmarksTexts

BB (Bacteria Biotope)

The Bacteria Biotope (BB) Task is part of the BioNLP Open Shared Tasks and meets the BioNLP-OST standards of quality, originality and data formats. Manually annotated data is provided for training, development and evaluation of information extraction methods. Tools for the detailed evaluation of system outputs are available. Support in performing linguistic processing are provided in the form of analyses created by various state-of-the art tools on the dataset texts.

8 papers0 benchmarksTexts

WiRe57

We manually performed the task of Open Information Extraction on 5 short documents, elaborating tentative guidelines for the task, and resulting in a ground truth reference of 347 tuples. [section 1]

8 papers1 benchmarksTexts

PreviousPage 51 of 158Next