Datasets

3,148 machine learning datasets

3,148 dataset results

Phishing and Benign Websites

An annotated dataset of 38,800 phishing and benign websites.

ProSLU (Profile-based Spoken Language Understanding)

In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), which requires a model not only depends on the text but also on the given supporting profile information. We further introduce a Chinese human-annotated dataset, with over 5K utterances annotated with intent and slots, and corresponding supporting profile information. In total, we provide three types of supporting profile information: (1) Knowledge Graph (KG) consists of entities with rich attributes, (2) User Profile (UP) is composed of user settings and information, (3) Context Awareness(CA) is user state and environmental information.

2 papers2 benchmarksTexts

PerCQA

PerCQA is the first Persian dataset for CQA (Community Question Answering). This dataset contains the questions and answers crawled from the most well-known Persian forum.

2 papers0 benchmarksTexts

TempQA-WD

TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a more challenging set of complex reasoning tasks. Specifically, the benchmark is a temporal question answering dataset with the following advantages: (a) it is based on Wikidata, which is the most frequently curated, openly available knowledge base, (b) it includes intermediate sparql queries to facilitate the evaluation of semantic parsing based approaches for KBQA, and (c) it generalizes to multiple knowledge bases: Freebase and Wikidata.

2 papers1 benchmarksTexts

RuMedBench

RuMedBench is a benchmark dataset for Russian medical language understanding.

2 papers0 benchmarksTexts

KazNERD

KazNERD is a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes.

2 papers0 benchmarksTexts

AWARE (AWARE: Aspect-Based Sentiment Analysis Dataset of Apps Reviews for Requirements Elicitation)

The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.

2 papers3 benchmarksTexts

NPSC (Norwegian Parliamentary Speech Corpus)

The Norwegian Parliamentary Speech Corpus (NPSC) is a speech corpus made by the Norwegian Language Bank at the National Library of Norway in 2019-2021. The NPSC consists of recordings of speech from Stortinget, the Norwegian parliament, and corresponding orthographic transcriptions to Norwegian Bokmål and Norwegian Nynorsk. All transcriptions are done manually by trained linguists or philologists, and the manual transcriptions are subsequently proofread to ensure consistency and accuracy. Entire days of Parliamentary meetings are transcribed in the dataset.

2 papers0 benchmarksSpeech, Texts

CoVaxLies v2

CoVaxLies v2 includes 47 Misinformation Targets (MisTs) found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.

2 papers0 benchmarksTexts

Study data

Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study File Descriptions File | Description --- | --- commit_categorizations.csv | Categorizations for the commits in our dataset. commits.csv | Information for the commits in our dataset datasets.csv | Contains the names and descriptions of our datasets. issue_categorizations.csv | Categorizations for the chosen issues from our dataset. issues.csv | Information for the issues in our dataset. pipeline_stages.csv | DL pipeline stages and their respective descriptions. problem_categories.csv | Problem categories and their respective descriptions. problem_causes.csv | Problem causes and their respective descriptions. problem_fixes.csv | Problem fixes and their respective descriptions. problem_symptoms.csv | Problem symptoms and their respective descriptions. studied_subjects_commits.csv | Project data for commits. studied_subjects_issues.csv | Project data for issues.

2 papers0 benchmarksTexts

Iconary

Iconary dataset is for testing multimodal communication with drawings and text.

2 papers0 benchmarksImages, Texts

K-SportsSum

K-SportsSum is a sports game summarization dataset with two characteristics: (1) K-SportsSum collects a large amount of data from massive games. It has 7,854 commentary-news pairs. To improve the quality, K-SportsSum employs a manual cleaning process; (2) Different from existing datasets, to narrow the knowledge gap, K-SportsSum further provides a large-scale knowledge corpus that contains the information of 523 sports teams and 14,724 sports players.

2 papers0 benchmarksTexts

PET: A new Dataset for Process Extraction from Natural Language Text

The dataset contains 45 documents containing narrative description of business process and their annotations. Annotated with activities, gateways, actors, and flow information.

2 papers0 benchmarksTexts

AWS Documentation

We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none dependin

2 papers0 benchmarksTexts

SSD_PHONE (Sub-Slot Dialogue dataset phone domain)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".

2 papers0 benchmarksTexts

TUSC (Tweets from US and Canada)

Tweets from US and Canada (TUSC) is a large dataset of more than 45 million geo-located tweets posted between 2015 and 2021 from US and Canada (TUSC), especially curated for natural language analysis

2 papers0 benchmarksTexts

SMC Text Corpus

Contents (As on March 4, 2019) The text corpus contains running text from various free licensed sources. - The whole content of Malayalam Wikipedia extracted on January 1, 2019 - News/Article from various sources, source mentioned in respective files: - 251 Mb - 8,60,159 lines - 98,15,533 words - 10,11,11,885 characters

2 papers0 benchmarksTexts

PeerSum

PeerSum is a new MDS dataset using peer reviews of scientific publications. The dataset differs from the existing MDS datasets in that summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents.

2 papers0 benchmarksTexts

PETCI (PETCI: A Parallel English Translation Dataset of Chinese Idioms)

PETCI is a Parallel English Translation dataset of Chinese Idioms, collected from an idiom dictionary and Google and DeepL translation. PETCI contains 4,310 Chinese idioms with 29,936 English translations. These translations capture diverse translation errors and paraphrase strategies.

2 papers0 benchmarksTexts

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 papers0 benchmarksGraphs, Texts

PreviousPage 92 of 158Next