Datasets

3,148 machine learning datasets

3,148 dataset results

ORCAS-I (Queries Annotated with Intent using Weak Supervision)

A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries.

2 papers3 benchmarksTexts

CareCall (CareCall for Seniors)

carecall is a Korean dialogue dataset for role-satisfying dialogue systems. The dataset was composed with a few samples of human-written dialogues using in-context few-shot learning of large-scale LMs. Large-scale LMs can generate dialogues with a specific personality, given a prompt consisting of a brief description of the chatbot’s properties and few dialogue examples. We use this method to build the entire dataset.

2 papers0 benchmarksTexts

CAVES (A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines)

CAVES is the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets.

2 papers0 benchmarksTexts

HeriGraph (Multimodal Machine Learning Datasets on Graphs of Heritage Values and Attributes)

The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.

2 papers0 benchmarksEnvironment, Graphs, Images, Texts

Korpus Malti

General Corpora for the Maltese Language.

2 papers0 benchmarksTexts

FIJO (French Insurance Job Offer dataset)

This dataset was collected as part of the multidisciplinary project Femmes face aux défis de la transformation numérique : une étude de cas dans le secteur des assurances (Women Facing the Challenges of Digital Transformation: A Case Study in the Insurance Sector) at Université Laval, funded by the Future Skills Centre. It includes job offers, in French, from insurance companies between 2009 and 2020.

2 papers0 benchmarksTexts

FeedbackQA

📄 Read 💾 Code 🔗 Webpage 💻 Demo 🤗 Huggingface Dataset 💬 Discussions

2 papers0 benchmarksTexts

satp-zsm-stage1 (Replication Data for: Crossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to zsm)

This is the replication data for the paper: "Crossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to Bahasa Melayu".

2 papers0 benchmarksTexts

DAST (Danish Stance)

This is an SDQC stance-annotated Reddit dataset for the Danish language generated within a thesis project. The dataset consists of over 5000 comments structured as comment trees and linked to 33 source posts.

2 papers0 benchmarksTexts

NovelCraft

Scene-focused, multi-modal, episodic data of the images and symbolic world-states seen by an agent completing a pogo-stick assembly task within a video game world. Classes consist of episodes with novel objects inserted. A subset of these novel objects can impact gameplay and agent behavior. Novelty objects can vary in size, position, and occlusion within the images. Usable for novelty detection, generalized category discovery, and class-imbalanced classification.

2 papers0 benchmarksImages, Texts

NCI (New Corpus for Ireland)

Contains a wide range of texts in Irish, including fiction, news reports, informative texts and official documents.

2 papers0 benchmarksTexts

GPI corpus (Government Privacy Instructions Corpus)

The GPI Corpus is a collection of 1,043 privacy laws, regulations, and guidelines ("GPIs") covering 182 jurisdictions around the world. These documents are provided in two file formats (i.e., PDF showing the original formatting on the source website and TXT containing just the text of the GPI) and, in some cases, in multiple languages (i.e., the original language(s) and an English translation).

2 papers0 benchmarksTexts

Online retail dataset

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. https://archive.ics.uci.edu/ml/datasets/online+retail

2 papers0 benchmarksTexts

BC7 NLM-Chem (BioCreative VII NLM-Chem)

Full-text chemical identification and indexing in PubMed articles.

2 papers3 benchmarksBiomedical, Texts

Basque TimeBank

A set of basque documents annotated with EusTimeML - a mark-up language for temporal information in Basque.

2 papers3 benchmarksTexts

Catalan TimeBank 1.0

Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.

2 papers3 benchmarksTexts

MentSum (Mental Health Summarization Dataset)

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.

2 papers3 benchmarksTexts

CodeQueries

CodeQueries Benchmark dataset consists of instances of semantic queries, code context and code spans in the context corresponding to the semantic queries. The dataset can be used in experiments involving semantic query comprehension with an extractive question-answering methodology over code. More details can be found in the paper.

2 papers0 benchmarksTexts

InLegalNER

InLegalNER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

2 papers4 benchmarksTexts

Gun Violence Corpus

The Gun Violence Corpus (GVC) consists of 241 unique incidents for which we have structured data on a) location, b) time c) the name, gender and age of the victims and d) the status of the victims after the incident: killed or injured. For these data, 510 news articles were gathered following the 'data to text' approach. The structured data and articles report on a variety of gun violence incidents, such as drive-by shootings, murder-suicides, hunting accidents, involuntary gun discharges, etcetera. The documents have been manually annotated for all mentions that make reference to the gun violence incident at hand.

2 papers0 benchmarksTexts

PreviousPage 93 of 158Next