19,997 machine learning datasets
19,997 dataset results
The ArtiFact dataset is a large-scale image dataset that aims to include a diverse collection of real and synthetic images from multiple categories, including Human/Human Faces, Animal/Animal Faces, Places, Vehicles, Art, and many other real-life objects. The dataset comprises 8 sources that were carefully chosen to ensure diversity and includes images synthesized from 25 distinct methods, including 13 GANs, 7 Diffusion, and 5 other miscellaneous generators. The dataset contains 2,496,738 images, comprising 964,989 real images and 1,531,749 fake images.
3DOH50K is the first real 3D human dataset for the problem of human reconstruction and pose estimation in occlusion scenarios. It contains 51600 images with accurate 2D pose and 3D pose, SMPL parameters, and binary mask.
Dataset for document shadow removal
CHAD: Charlotte Anomaly Dataset CHAD is high-resolution, multi-camera dataset for surveillance video anomaly detection. It includes bounding box, Re-ID, and pose annotations, as well as frame-level anomaly labels, dividing all frames into two groups of anomalous or normal. You can find the paper with all the details in the following link: CHAD: Charlotte Anomaly Dataset. Please refer to the page of the dataset for more information.
HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
Human-Art is a versatile human-centric dataset to bridge the gap between natural and artificial scenes. It includes twenty high-quality human scenes, including natural and artificial humans in both 2D representation and 3D representation. It includes 50,000 images including more than 123,000 human figures in 20 scenarios, with annotations of human bounding box, 21 2D human keypoints, human self-contact keypoints, and description text.
Large Scale Composed Image Retrieval (LaSCo) is a new dataset for Composed Image Retrieval (CoIR), x10 times larger than current ones.
An annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff) for personal attacks, aggression, and toxicity.
Trajectories of 3 dynamical systems: - Pendulum - Lotka-Voltera - 3-body system
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications, published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. To the best of our knowledge, this is the first QA dataset for scholarly KGs.
Aspect-based sentiment analysis (ABSA) aims to detect the targets (which are composed by continuous words), aspects and sentiment polarities in text. Published datasets from SemEval-2015 and SemEval-2016 reveal that a sentiment polarity depends on both the target and the aspect. However, most of the existing methods consider predicting sentiment polarities from either targets or aspects but not from both, thus they easily make wrong predictions on sentiment polarities. In particular, where the target is implicit, i.e., it does not appear in the given text, the methods predicting sentiment polarities from targets do not work. To tackle these limitations in ABSA, this paper proposes a novel method for target-aspect-sentiment joint detection. It relies on a pre-trained language model and can capture the dependence on both targets and aspects for sentiment prediction. Experimental results on the SemEval-2015 and SemEval-2016 restaurant datasets show that the proposed method achieves a high
The Human Related version of UBnormal ("UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection," Acsintoae et al.) was introduced by Flaborea et al. in the paper "Contracting Skeletal Kinematics for Human-Related Video Anomaly Detection".
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t
The data is provided by 10x Genomics under "Single Cell 3' Paper: Zheng et al. 2017 (v1 Chemistry)" and consists of data from the following 9 cell types: CD4+/CD45RA+/CD25- naïve T cells, CD4+ helper T cells, CD4+/CD25+ regulatory T cells, CD4+/CD45RO+ memory T cells, CD8+/CD45RA+ naïve cytotoxic T cells, CD8+ cytotoxic T cells, CD56+ natural killer cells, CD34+ cells, and CD19+ B cells. The data contains 32738 genes and 92043 cells.
Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.
Over the past few years, different Computer-Aided Diagnosis (CAD) systems have been proposed to tackle skin lesion analysis. Most of these systems work only for dermoscopy images since there is a strong lack of public clinical images archive available to evaluate the aforementioned CAD systems. To fill this gap, we release a skin lesion benchmark composed of clinical images collected from smartphone devices and a set of patient clinical data containing up to 21 features. The dataset consists of 1373 patients, 1641 skin lesions, and 2298 images for six different diagnostics: three skin diseases and three skin cancers. In total, 58.4% of the skin lesions are biopsy-proven, including 100% of the skin cancers. By releasing this benchmark, we aim to support future research and the development of new tools to assist clinicians to detect skin cancer.
We propose the MusicQA dataset to train Music-enabled question-answering models and is used for training and evaluating our MU-LLaMA model. This dataset is generated using the MusicCaps and MagnaTagATune datasets. We utilize the descriptions/tags from existing datasets to prompt the MPT-7B Chat model to generate question-answer pairs through inference, reasoning, and paraphrasing. The dataset contains 12,542 music files for training making up 76.15 hours of music with 112,878 question-answer pairs.
This dataset contains simulations of a complex, large-scale chemical plant proposed by Downs and Vogel (1993). As described by Reinartz, Kulahci and Ravn (2021):
The challenge of accurately segmenting individual trees from laser scanning data hinders the assessment of crucial tree parameters necessary for effective forest management, impacting many downstream applications. While dense laser scanning offers detailed 3D representations, automating the segmentation of trees and their structures from point clouds remains difficult. The lack of suitable benchmark datasets and reliance on small datasets have limited method development. The emergence of deep learning models exacerbates the need for standardized benchmarks. Addressing these gaps, the FOR-instance data represent a novel benchmarking dataset to enhance forest measurement using dense airborne laser scanning data, aiding researchers in advancing segmentation methods for forested 3D scenes.
SODA-A is a large-scale benchmark specialized for small object detection task under aerial scenes, which has 800203 instances with oriented rectangle box annotation across 9 classes. It contains 2510 high-resolution images extracted from Google Earth.