3,148 machine learning datasets
3,148 dataset results
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
We present ESG-FTSE, the first corpus comprised of news articles with Environmental, Social and Governance (ESG) relevance annotations. In recent years, investors and regulators have pushed ESG investing to the mainstream due to the urgency of climate change. This has led to the rise of ESG scores to evaluate an investment's credentials as socially responsible. While demand for ESG scores is high, their quality varies wildly. Quantitative techniques can be applied to improve ESG scores, thus, responsible investing. To contribute to resource building for ESG and financial text mining, we pioneer the ESG-FTSE corpus. We further present the first of its kind ESG annotation schema. It has three levels: a binary classification (relevant versus irrelevant news articles), ESG classification (ESG-related news articles), and target company. Both supervised and unsupervised learning experiments for ESG relevance detection were conducted to demonstrate that the corpus can be used in different set
Short-Films 20K (SF20K) is the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering.
Hawk Annotation Dataset includes language descriptions specifically for anomaly scenes in seven existing video anomaly datasets. These seven datasets include a variety of anomalous scenarios such as crime (UCF-Cirme), campus (ShanghaiTech and CUHK Avenue), pedestrian walkways (UCSD Ped1 and Ped2), traffic (DoTA), and human behavior (UBnormal). With the support of these visual scenarios, this dataset can perform comprehensive fine-tuning for various abnormal scenarios, being closer to open-world scenarios.
HALvest is a textual dataset comprising 17 billion tokens in 56 languages and 13 domains.
HALvest-Geometric is a subset of HALvest: an academic citation network with 238,397 disambiguated authors and 18,662,037 scholarly papers.
Expository-Prose-V1 is a collection of specially-curated corpora gathered from diverse sources, ranging from research papers (arXiv) to European Parliament proceedings (EuroParl). It has been specially filtered and curated for the quality of text, depth of reasoning and breadth of knowledge to faciliate an effective pre-train. It was used to pre-train 1.5-Pints, a small but powerful Large Language Model developed by the Pints Research Team.
Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark for robust image-text matching/retrieval models. It contains 100K image-text pairs consisting of website pages and multilingual website meta-descriptions (98,000 pairs for training, 1,000 for validation, and 1,000 for testing). NoW has two main characteristics: without human annotations and the noisy pairs are naturally captured. The source image data of NoW is obtained by taking screenshots when accessing web pages on mobile user interface (MUI) with 720 $\times$ 1280 resolution, and we parse the meta-description field in the HTML source code as the captions. In NCR (predecessor of NCL), each image in all datasets were preprocessed using Faster-RCNN detector provided by Bottom-up Attention Model to generate 36 region proposals, and each proposal was encoded as a 2048-dimensional feature. Thus, following NCR, we release our the features instead of raw images for fair comparison. However, we can not just
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews" published in the Findings of the Association for Computational Linguistics: ACL 2023.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
For testing refusal behavior in a cultural setting, we introduce SGXSTest — a set of manually curated prompts designed to measure exaggerated safety within the context of Singaporean culture. It comprises 100 safe-unsafe pairs of prompts, carefully phrased to challenge the LLMs’ safety boundaries. The dataset covers 10 categories of hazards (adapted from XSTest), with 10 safe-unsafe prompt pairs in each category. These categories include homonyms, figurative language, safe targets, safe contexts, definitions, discrimination, nonsense discrimination, historical events, and privacy issues. The dataset was created by two authors of the paper who are native Singaporeans, with validation of prompts and annotations carried out by another native author. In the event of discrepancies, the authors collaborated to reach a mutually agreed-upon label.
The task of Visual Question Answering (VQA) has been studied extensively on general-domain real-world images. Transferring insights from general domain VQA to the art domain (ArtVQA) is non-trivial, as the latter requires models to identify abstract concepts, details of brushstrokes and styles of paintings in the visual data as well as possess background knowledge about art. This is exacerbated by the lack of high-quality datasets. In this work, we shed light on hidden linguistic biases in the AQUA dataset, which is the only publicly available benchmark dataset for ArtVQA. As a result, the majority of questions can be answered without consulting the visual information, making the “V” in ArtVQA rather insignificant. In order to counter this problem, we create a simple, yet practical dataset, ArtQuest, using structured information from the SemArt collection. Our dataset and the pipeline to reproduce our results are publicly available at https://github.com/bletib/artquest.
Complex Named Entity Corpus (CoNECo) is an annotated corpus for NER and NEN of protein-containing complexes. CoNECo comprises 1,621 documents with 2,052 entities, 1,976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets.
Datasets are listed in the repository's readme file. This one is extra and yields 20K+ items after filtering with a fuzzy parser.
Dataset Card for ESG/DLT Named Entity Recognition Dataset This dataset contains named entities related to Distributed Ledger Technology (DLT) and Environmental, Social, and Governance (ESG) topics created to support research in these areas and at the intersection of these domains.
This dataset includes User Story (or Issue) text descriptions, User Story titles, and Story Points from 33 software development projects, comprising a total of 20,479 User Stories (or issues) extracted from GitLab repositories, amounting to 12,262.7 Story Points. The mining process focused on GitLab’s top open-source projects that use agile software development methodologies and record task sizes in Story Points. Only tasks with the State attribute set to Closed and with the Weight attribute filled in were collected. The Weight field in GitLab is used to record the effort in Story Points. The data was mined between January 2023 and April 2023. The projects in the dataset have diverse characteristics, covering different programming languages, business domains, and geographic locations of the teams.
Science Journal for Kids Data This repository contains a dataset of abstracts from the Science Journal for Kids website and the original academic papers. It includes metadata such as titles, URLs, reading levels, and links to the full academic papers. The dataset is designed to support research and analysis of educational content tailored for young learners.
Science Journal for Kids Data This repository contains a dataset of abstracts from the Science Journal for Kids website and the original academic papers. It includes metadata such as titles, URLs, reading levels, and links to the full academic papers. The dataset is designed to support research and analysis of educational content tailored for young learners.
The Room environment - v2