19,997 machine learning datasets
19,997 dataset results
Human Activity Recognition (HAR) refers to the capacity of machines to perceive human actions. This dataset contains information on 18 different activities collected from 90 participants (75 male and 15 female) using smartphone sensors (Accelerometer and Gyroscope). It has 1945 raw activity samples collected directly from the participants, and 20750 subsamples extracted from them. The activities are:
Hate speech has become one of the most significant issues in modern society, with implications in both the online and offline worlds. However, most of the work has primarily focused on text media, with relatively little work on images and even less on videos. Thus, early-stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. Therefore, we curated approximately ~43 hours of videos from BitChute and manually annotated them as hate or non-hate, along with the frame spans that could explain the labeling decision.
The George B. Moody PhysioNet Challenges are annual competitions that invite participants to develop automated approaches for addressing important physiological and clinical problems. The 2024 Challenge invites teams to develop algorithms for digitizing and classifying electrocardiograms (ECGs) captured from images or paper printouts. Despite the recent advances in digital ECG devices, physical or paper ECGs remain common, especially in the Global South. These physical ECGs document the history and diversity of cardiovascular diseases (CVDs), and algorithms that can digitize and classify these images have the potential to improve our understanding and treatment of CVDs, especially for underrepresented and underserved populations.
KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning.
LeetCode-Hard is a benchmark dataset for code generation, consisting of 40 challenging LeetCode "hard-level" questions across 19 programming languages. It is designed to evaluate the problem-solving and functional correctness capabilities of large language models (LLMs), particularly in handling complex algorithmic tasks. This dataset was used to assess the Reflexion framework, which leverages verbal reinforcement learning to improve LLM performance on difficult coding problems.
math 500
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Images with paired ground-truth caption hierarchies
FEWS (Few-shot Examples of Word Senses) is a few-shot dataset for English Word Sense Disambiguation (WSD) gathered from Wiktionary, an online, crowd-sourced dictionary. FEWS contains over 121,000 labeled examples of ambigous words, corresponding to more than 71,000 sense types. The evaluation for FEWS is split into few-shot and zero-shot settings, to better faciliate evaluating on few-shot learning and perfromance on rare senses.
The nordland used in SALAD and BoQ (2760 queries, 27592 reference images, threshold: 1 frames).
💡 Description A new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS), to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. We collected 205,181 masks, with an average track duration of 14.27s. M$^3$-VOS covers 120+ categories of objects across 6 phases within 14 scenarios, encompassing 23 specific phase transitions.
ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks.
The dataset consists of 96 terrain-corrected (Level-1T) scenes from Landsat 8 OLI and TIRS, covering diverse biomes. This variety supports cloud detection and removal in complex environments. The dataset includes manually generated cloud masks with pixel-level annotations for cloud shadow, clear sky, thin clouds, and cloud areas. Each scene is cropped into 512×512 pixel patches and split into training, validation, and test sets (6:2:2 ratio). It is a valuable resource for training and evaluating fine-grained cloud segmentation models across various terrains.
CUHK Face Sketch database (CUFS) is for research on face sketch synthesis and face sketch recognition. It includes 188 faces from the Chinese University of Hong Kong (CUHK) student database, 123 faces from the AR database 1, and 295 faces from the XM2VTS database 2. There are 606 faces in total. For each face, there is a sketch drawn by an artist based on a photo taken in a frontal pose, under normal lighting condition, and with a neutral expression.
Although promising results have been achieved in the areas of traffic-sign detection and classification, few works have provided simultaneous solutions to these two tasks for realistic real world images. We make two contributions to this problem. Firstly, we have created a large traffic-sign benchmark from 100000 Tencent Street View panoramas, going beyond previous benchmarks. We call this benchmark Tsinghua-Tencent 100K. It provides 100000 images containing 30000 traffic-sign instances. These images cover large variations in illuminance and weather conditions. Each traffic-sign in the benchmark is annotated with a class label, its bounding box and pixel mask. Secondly, we demonstrate how a robust end-to-end convolutional neural network (CNN) can simultaneously detect and classify traffic-signs. Most previous CNN image processing solutions target objects that occupy a large proportion of an image, and such networks do not work well for target objects occupying only a small fraction of
This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016).
The OQMD is a database of DFT calculated thermodynamic and structural properties of one million materials, created in Chris Wolverton's group at Northwestern University.
Dataset for one-shot segmentation.
Arabic handwriting dataset.
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.