3,148 machine learning datasets
3,148 dataset results
A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries.
carecall is a Korean dialogue dataset for role-satisfying dialogue systems. The dataset was composed with a few samples of human-written dialogues using in-context few-shot learning of large-scale LMs. Large-scale LMs can generate dialogues with a specific personality, given a prompt consisting of a brief description of the chatbot’s properties and few dialogue examples. We use this method to build the entire dataset.
CAVES is the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
General Corpora for the Maltese Language.
This dataset was collected as part of the multidisciplinary project Femmes face aux défis de la transformation numérique : une étude de cas dans le secteur des assurances (Women Facing the Challenges of Digital Transformation: A Case Study in the Insurance Sector) at Université Laval, funded by the Future Skills Centre. It includes job offers, in French, from insurance companies between 2009 and 2020.
📄 Read<br> 💾 Code<br> 🔗 Webpage<br> 💻 Demo<br> 🤗 Huggingface Dataset<br> 💬 Discussions
This is the replication data for the paper: "Crossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to Bahasa Melayu".
This is an SDQC stance-annotated Reddit dataset for the Danish language generated within a thesis project. The dataset consists of over 5000 comments structured as comment trees and linked to 33 source posts.
Scene-focused, multi-modal, episodic data of the images and symbolic world-states seen by an agent completing a pogo-stick assembly task within a video game world. Classes consist of episodes with novel objects inserted. A subset of these novel objects can impact gameplay and agent behavior. Novelty objects can vary in size, position, and occlusion within the images. Usable for novelty detection, generalized category discovery, and class-imbalanced classification.
Contains a wide range of texts in Irish, including fiction, news reports, informative texts and official documents.
The GPI Corpus is a collection of 1,043 privacy laws, regulations, and guidelines ("GPIs") covering 182 jurisdictions around the world. These documents are provided in two file formats (i.e., PDF showing the original formatting on the source website and TXT containing just the text of the GPI) and, in some cases, in multiple languages (i.e., the original language(s) and an English translation).
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. https://archive.ics.uci.edu/ml/datasets/online+retail
Full-text chemical identification and indexing in PubMed articles.
A set of basque documents annotated with EusTimeML - a mark-up language for temporal information in Basque.
Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.
Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.
CodeQueries Benchmark dataset consists of instances of semantic queries, code context and code spans in the context corresponding to the semantic queries. The dataset can be used in experiments involving semantic query comprehension with an extractive question-answering methodology over code. More details can be found in the paper.
InLegalNER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
The Gun Violence Corpus (GVC) consists of 241 unique incidents for which we have structured data on a) location, b) time c) the name, gender and age of the victims and d) the status of the victims after the incident: killed or injured. For these data, 510 news articles were gathered following the 'data to text' approach. The structured data and articles report on a variety of gun violence incidents, such as drive-by shootings, murder-suicides, hunting accidents, involuntary gun discharges, etcetera. The documents have been manually annotated for all mentions that make reference to the gun violence incident at hand.