19,997 machine learning datasets
19,997 dataset results
The synthetic ShapeNet intrinsic image decomposition dataset used for training the deep CNN models IntrinsicNet and RetiNet of CVPR2018. See Section 4.1 of the paper for details.
HalluEditBench is a comprehensive benchmark for evaluating knowledge editing methods' effectiveness in correcting real-world hallucinations. HalluEdit features a rigorously constructed dataset spanning nine domains and 26 topics. It evaluates methods across five dimensions: Efficacy, Generalization, Portability, Locality, and Robustness.
I2-2000FPS is the first high-speed video dataset offering an unprecedented temporal resolution of 2000 frames per second (fps). Captured using the commercially available Chronos 1.4 high-speed CMOS camera, the dataset includes a diverse range of objects varying in size, shape, orientation, and motion, as well as various camera movements. This dataset is designed to enable research in areas such as motion analysis, object tracking, and scene understanding at extreme temporal resolutions. Potential applications span fields like sports analysis, robotics, autonomous navigation, and high-speed videography.
This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format. Scans were initially skull-stripped and co-registered, before their tumor segmentation labels were produced by an automated computational method. These segmentation labels were revised and any label misclassifications were manually corrected/approved by expert board-certified neuroradiologists. The final labels were used to extract a rich panel of imaging features, including intensity, volumetric,
This dataset contains the extraction made in 2022 of all the 622 datasets that existed then at the UCI Machine Learning Repository. It contains the index, its name, its url, the instances (number os lines), the number of attributes (columns), the year it was created, the area, such as Life, Social, etc., the web_hits at the time, the data folder url, where the data were in the internet, the dataset_file_url, the URL for the data, the dataset_file_format (format, such as data, txt, Z, etc), the names_file_url, which describe the files with the description of the attributes, the names_file_format which describe the format of the previous file, the attribute_info, which describe the information of all the attributes or columns that are in the dataset, the source, the data_set_information, the relevant_papers associated with this dataset, the papers_that_cite_this_data_set, and a final column with the number of papers that cite this dataset.
A set of java bugs With executable test cases
The researchers of Qatar University have compiled the COVID-QU-Ex dataset, which consists of 33,920 chest X-ray (CXR) images including: * 11,956 COVID-19 * 11,263 Non-COVID infections (Viral or Bacterial Pneumonia) * 10,701 Normal Ground-truth lung segmentation masks are provided for the entire dataset. This is the largest ever created lung mask dataset.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
SynMirror consists of samples rendered from 3D assets of two widely used 3D object datasets - Objaverse and Amazon Berkeley Objects (ABO) placed in front of a mirror in a virtual blender environment. The total number of rendered samples are $198,204$. Each rendering contains colors, category_id_segmaps, depth, normals and cam_states.
Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019
Introduction Welcome to InfiniteBench, a cutting-edge benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts (100k+ tokens). Long contexts are crucial for enhancing applications with LLMs and achieving high-level interaction. InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+, which is 10 times longer than traditional datasets.
Articles originating from subreddits with explicitly stated ideologies are categorized into three groups: 72,488 articles in the Liberal class, 79,573 articles in the Conservative class, and 225,083 articles in the Restricted class.
Data 1: Raw and Unlabeled; 2 million unlabeled replies from 17 Telegram channels. Data 2: Raw and Labeled; 15,076 replies from 17 Telegram channels categorized as no threat, judicial threat, and non-judicial threat.
LLMs' lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels.
The LeukemiaAttri dataset is a large-scale, multi-domain collection of microscopy images derived from leukemia patient samples, enriched with detailed morphological information. This dataset comprises a total of 28.9K images (2.4K × 2 × 3 × 2), which were captured using both low-cost and high-cost microscopes at three different resolutions: 10x, 40x, and 100x, utilizing various cameras. In addition to providing location annotations for each white blood cell (WBC), the dataset includes comprehensive morphological attributes for every WBC, enhancing its utility for research and analysis in the field.
Experiments on Li-Ion batteries. Charging and discharging at different temperatures. Records the impedance as the damage criterion. The data set was provided by the NASA Prognostics Center of Excellence (PCoE).
The AIME dataset contains 6,000 audio tracks generated by 12 music generation models in addition to 500 tracks from MTG-Jamendo. The prompts used to generate music are combinations of representative and diverse tags from the MTG-Jamendo dataset.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The dataset enables the mapping from text space to numerical space and vice versa.