3,275 machine learning datasets
3,275 dataset results
A dataset containing four sets of playing card images. Each set contains 10,000 images and has a series of attributes. Cards are randomly rotated, flipped and scaled (within limits).
LLaVA-Rad MIMIC-CXR features more accurate section extractions from MIMIC-CXR free-text radiology reports. Traditionally, rule-based methods were used to extract sections such as the reason for exam, findings, and impression. However, these approaches often fail due to inconsistencies in report structure and clinical language. In this work, we leverage GPT-4 to extract these sections more reliably, adding 237,073 image-text pairs to the training split and 1,952 pairs to the validation split. This enhancement afforded the development and fine-tuning of LLaVA-Rad, a multimodal large language model (LLM) tailored for radiology applications, achieving improved performance on report generation tasks.
The “Fused Image dataset for convolutional neural Network-based crack Detection” (FIND) is a large-scale image dataset with pixel-level ground truth crack data for deep learning-based crack segmentation analysis. It features four types of image data including raw intensity image, raw range (i.e., elevation) image, filtered range image, and fused raw image. The FIND dataset consists of 2500 image patches (dimension: 256x256 pixels) and their ground truth crack maps for each of the four data types.
There are 537 RGB jpg images of cracks and corresponding png binary segmentation masks of crack: a training set with 300 images and a testing set with 237 images. All of the images are of a fixed size of 544 × 384 pixels. Images diversity: 1. asphalt 22%, concrete 78% 2. dirty 22.4%, rough 40%, bare 37,6% 3 The width of cracks is in ranges from 1 pixel to 180 ones in the database.
We compiled a new dataset (the PERO layout dataset) that contains 683 images from various sources and historical periods with complete manual text block, text line polygon and baseline annotations. The included documents range from handwritten letters to historic printed books and newspapers and contain various languages including Arabic and Russian. Part of the PERO dataset was collected from existing datasets and extended with additional layout annotations (cBAD, IMPACT and BADAM). The dataset is split into 456 training and 227 testing images.
We propose the first standardized benchmark in multimodal continual learning for video data, defining protocols for training and metrics for evaluation. This standardized framework allows researchers to effectively compare models, driving advancements in AI systems that can continuously learn from diverse data sources.
Molecules represent tokens of the language of chemistry, which underlies not only chemistry itself, but also scientific fields that use chemical information such as pharmacy, material science, and molecular biology. Existing molecular information is distributed across text books, publications, and patents. To describe structural information (spatial arrangement of atoms), molecules are commonly drawn as 2D images in such documents, which makes Optical Chemical Structure Understanding (OCSU) play an important role in molecule-centric scientific discovery.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
A Ball-Collision Dataset (ABCD) serves as a comprehensive benchmark for investigating the interaction dynamics of moving objects within 3D environments. It includes multimodal recordings of ball trajectories, captured under various conditions, including different elevation angles, flight lengths, and speeds. This dataset contains raw event, rgb and IMU data collected from an FPGA-based drone and 3D motion capture data of the drone (static) and a moving ball.
We construct Gaze-CIFAR-10, a gaze-augmented image dataset based on the standard CIFAR-10 benchmark, enhanced with human eye-tracking annotations collected using the HTC VIVE Pro Eye headset. The original CIFAR-10 dataset consists of 60,000 color images across 10 categories, each with a resolution of $32 \times 32$ pixels. To enable reliable human gaze tracking, all images are upsampled to $1024 \times 1024$ using the Real-ESRGAN model.
This dataset contains pre-processed versions of datasets introduced in prior works. Additionally, it also contains new data that are pertinent to the paper.
CropCOCO is a validation-only dataset of COCO val 2017 images cropped such that some keypoints annotations are outside of the image. It can be used for keypoint detection, out-of-image keypoint detection and localization, person detection and amodal person detection.
We establish the first large benchmark called IRBFD to facilitate the research in the area of nonuniformity correction and infrared UAV target detection, which consists of 50,000 manually labeled infrared images with various nonuniformity levels, multi-scale UAV targets and rich backgrounds with target annotations.
Current autonomous driving algorithms heavily rely on the visible spectrum, which is prone to performance degradation in adverse conditions like fog, rain, snow, glare, and high contrast. Although other spectral bands like near-infrared (NIR) and long-wave infrared (LWIR) can enhance vision perception in such situations, they have limitations and lack large-scale datasets and benchmarks. Short-wave infrared (SWIR) imaging offers several advantages over NIR and LWIR. However, no publicly available large-scale datasets currently incorporate SWIR data for autonomous driving. To address this gap, we introduce the RGB and SWIR Multispectral Driving (RASMD) dataset, which comprises 100,000 synchronized and spatially aligned RGB-SWIR image pairs collected across diverse locations, lighting, and weather conditions. In addition, we provide a subset for RGB-SWIR translation and object detection annotations for a subset of challenging traffic scenarios to demonstrate the utility of SWIR imaging t
GraspClutter6D is a large-scale real-world dataset for robust object perception and robotic grasping in cluttered environments. It features 1,000 highly cluttered scenes with dense arrangements (average 14.1 objects/scene with 62.6% occlusion), 200 household, industrial, and warehouse objects captured in 75 diverse environment configurations (bins, shelves, and tables), multi-view data from 4 RGB-D cameras (RealSense D415, D435, Azure Kinect, and Zivid One+), and comprehensive annotations including 736K 6D object poses and 9.3 billion feasible robotic grasps for 52K RGB-D images. The dataset provides a challenging testbed for segmentation, 6D pose estimation, and grasp detection algorithms in realistic cluttered scenarios.
DivShift North American West Coast DivShift Paper | Extended Version | Code
A comprehensive object-instance ReID dataset with multiple indoor object instances under varying lighting conditions.
A real world dataset for benchmarking global localization in complex indoor environments.
A ProcTHOR created synthetic dataset for benchmarking global localization in complex indoor environments.
Predicting stock market prices following corporate earnings calls remains a significant challenge for investors and researchers alike, requiring innovative approaches that can process diverse information sources. This study investigates the impact of corporate earnings calls on stock prices by introducing a multi-modal predictive model. We leverage textual data from earnings call transcripts, along with images and tables from accompanying presentations, to forecast stock price movements on the trading day immediately following these calls. To facilitate this research, we developed the MiMIC (Multi-Modal Indian Earnings Calls) dataset, encompassing companies representing the Nifty 50, Nifty MidCap 50, and Nifty Small 50 indices. The dataset includes earnings call transcripts, presentations, fundamentals, technical indicators, and subsequent stock prices. We present a multimodal analytical framework that integrates quantitative variables with predictive signals derived from textual and v