19,997 machine learning datasets
19,997 dataset results
UK Biobank participants have generously provided a very wide range of information about their health and well-being since recruitment began in 2006. This has been added to in the following ways:
A real-world dataset, with hyper-accurate digital counterpart & comprehensive ground-truth annotation.
ValueConsistency is a dataset of both controversial and uncontroversial questions in English, Chinese, German, and Japanese for topics from the U.S., China, Germany, and Japan. It was generated via prompting by GPT-4 and validated manually.
An accurate dataset describing trajectories performed by all the 442 taxis running in the city of Porto, in Portugal.
Provide:
A benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement.
RUFF is a large-scale dataset to measure pronoun fidelity in English.
Context Malicious URLs or malicious website is a very serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by downloads, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. We have collected this dataset to include a large number of examples of Malicious URLs so that a machine learning-based model can be developed to identify malicious urls so that we can stop them in advance before infecting computer system or spreading through inteinternet.
It is desirable for detection and classification algorithms to generalize to unfamiliar environments, but suitable benchmarks for quantitatively studying this phenomenon are not yet available. We present a dataset designed to measure recognition generalization to novel environments. The images in our dataset are harvested from twenty camera traps deployed to monitor animal populations. Camera traps are fixed at one location, hence the background changes little across images; capture is triggered automatically, hence there is no human bias. The challenge is learning recognition in a handful of locations, and generalizing animal detection and classification to new locations where no training data is available. In our experiments state-of-the-art algorithms show excellent performance when tested at the same location where they were trained. However, we find that generalization to new locations is poor, especially for classification systems.
The DAPlankton dataset consists of over 110k expert-labeled plankton images. The data is divided into two subsets: DAPlankton_LAB and DAPlankton_SEA. DAPlankton_LAB consists of images captured from multiple mono-specific phytoplankton cultures, which were analysed using three different imaging instruments: Imaging FlowCytoBot (IFCB), CytoSense (CS) flow cytometer, and FlowCam (FC) imaging microscope each producing cropped images with one plankton particle in each. An expert further verified the class of each image, ensuring that there was no cross contamination between different cultures. This process resulted in a balanced dataset with negligible label uncertainty. DAPlankton_SEA consists of images captured from water samples collected from the Baltic Sea using two different imaging instruments: IFCB and CS. Each image was manually labeled by an expert. DAPlankton_SEA provides a realistic and more challenging dataset with a large class imbalance and natural intra-class variance.
TwinViews-13k is a dataset of 13,855 pairs of left-leaning and right-leaning political statements, each pair matched by topic. It was created to study political bias in reward and language models, with a focus on understanding the interaction between model alignment to truthfulness and the emergence of political bias. The dataset was generated using GPT-3.5 Turbo, with extensive auditing to ensure ideological balance and topical relevance. This dataset can be used for various tasks related to political bias, natural language processing, and model alignment, particularly in studies examining how political orientation impacts model outputs.
K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith, “Part-of-speech tagging for Twitter: Annotation, features, and experiments”, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011, pp. 42–47.
C2A: Combination to Application Dataset Overview This repository contains the code and information for the paper "UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios" by Ragib Amin Nihal, Benjamin Yen, Katsutoshi Itoyama, and Kazuhiro Nakadai.
COMPASS-XP is a dataset of matched photographic and X-ray images of single objects, made available for use in Machine Learning & Computer Vision research, in particular in the context of transport security. Objects are imaged in multiple poses, and accompanied by metadata including labels for whether we consider the object to be dangerous in the context of aviation. Object classes overlap with those in the popular ImageNet Large Scale Visual Recognition Challenge class set and theWordNet lexical database, and identifiers for shared classes in both schemes are also provided.
The dataset is a .h5 file comprised of entries with keys of the form (n,m), denoting the dimensions of the system matrix on which the simulations have been performed. The value of each key are two arrays, one to store the number of iterations needed to terminate the process for each simulation, and the other for the number of elements present at the terminal state of each simulation. Thus, given the maximum number of elements n*m in each system, the estimated percolation threshold can be computed by averaging the ratios between the elements at each terminal state and the system size. Overall, 207950010 simulations have been performed. And, this dataset was used to perform a complexity analysis on: https://arxiv.org/abs/2410.11874
A significant challenge in removing shadows from indoor scenes is obtaining shadow-free images. To overcome this challenge, we propose a novel rendering pipeline for generating shadowed and shadow-free images under direct and indirect illumination, and create a comprehensive synthetic dataset that contains over 30,000 image pairs, covering various object types and lighting conditions.
A high-quality synthetic dataset for object relighting. Covering a wide range of geometry and material.
A high-quality captured dataset for object relighting. Covering a wide range of geometry and material.
We introduce the Chinese Image Implication Understanding Benchmark CII-Bench, a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images. These images, including abstract artworks, comics and posters, possess visual implications that require an understanding of visual details and reasoning ability. CII-Bench reveals whether current MLLMs, leveraging their inherent comprehension abilities, can accurately decode the metaphors embedded within the complex and abstract information presented in these images.
The COFAR (COmmonsense and FActual Reasoning) dataset is a collection of images and text queries specifically designed to challenge and evaluate image search models that aim to go beyond simple visual matching. It focuses on the ability of these models to perform commonsense and factual reasoning, a capability currently lacking in most existing image search technology.