3,275 machine learning datasets
3,275 dataset results
UV6K is a high-resolution remote sensing urban vehicle segmentation dataset.
The paper presents a study of Clickbait PDFs, which are PDF documents leading to various attacks on the Web. Clickbait PDFs are different from the well-known "MalPDFs", usually found in phishing emails, as they do not contain malware.
EPIC-STATES builds upon the raw data in the EPIC-KITCHENS dataset and consists of 10 object state categories: open, close, in-hand, out-of-hand, whole, cut, raw, cooked, peeled, unpeeled. EPIC-STATES consists of 14,346 object bounding boxes from the EPIC-KITCHENS dataset (2018 version), each labeled with 10 binary labels corresponding to the 10 state classes.
EPIC-ROI builds on top of the EPIC-KITCHENS dataset, and consists of 103 diverse images with pixel-level annotations for regions where human hands frequently touch in everyday interaction. Specifically, image regions that afford any of the most frequent actions: take, open, close, press, dry, turn, peel are considered as positives. We manually watched video for multiple participants to define a) object categories, and b) specific regions within each category where participants interacted while conducting any of the 7 selected actions. These 103 images were sampled from across 9 different kitchens (7 to 15 images with minimal overlap, from each kitchen). EPIC-ROI is only used for evaluation, and contains 32 val images and 71 test images. Images from the same kitchen are in the same split. The Regions-of-Interaction task is to score each pixel in the image with the probability of a hand interacting with it. Performance is measured using average precision.
To facilitate research on asynchrony for collaborative perception, we simulate the first collaborative perception dataset with different temporal asynchronies based on CARLA, named IRregular V2V(IRV2V). We set 100ms as ideal sampling time interval and simulate various asynchronies in real-world scenarios from two main aspects: i) considering that agents are unsynchronized with the unified global clock, we uniformly sample a time shift $\delta_s\sim \mathcal{U}(-50,50)\text{ms}$ for each agent in the same scene, and ii) considering the trigger noise of the sensors, we uniformly sample a time turbulence $\delta_d\sim \mathcal{U}(-10,10)\text{ms}$ for each sampling timestamp. The final asynchronous time interval between adjacent timestamps is the summation of the time shift and time turbulence. In experiments, we also sample the frame intervals to achieve large-scale and diverse asynchrony. Each scene includes multiple collaborative agents ranging from 2 to 5. Each agent is equipped with
The Chess Recognition Dataset (ChessReD) comprises a diverse collection of images of chess formations captured using smartphone cameras; a sensor choice made to ensure real-world applicability. The dataset is accompanied by detailed annotations providing information about the chess pieces formation in the images. Therefore, the number of annotations for each image depends on the number of chess pieces depicted in it. There are 12 category ids in total (i.e., 6 piece types per colour) and the chessboard coordinates are in the form of algebraic notation strings (e.g., "a8").
The Chess Recognition Dataset 2K (ChessReD2K) comprises a diverse collection of images of chess formations captured using smartphone cameras; a sensor choice made to ensure real-world applicability. The dataset is accompanied by detailed annotations providing information about the chess pieces formation in the images, bounding-boxes, and chessboard corner annotations. The number of annotations for each image depends on the number of chess pieces depicted in it. There are 12 category ids in total (i.e., 6 piece types per colour) and the chessboard coordinates are in the form of algebraic notation strings (e.g., "a8"). The corners are annotated based on their location on the chessboard (e.g., "bottom-left") with respect to the white player's view. This discrimination between these different types of corners provides information about the orientation of the chessboard that can be leveraged to determine the image's perspective and viewing angle.
The Creative Visual Storytelling Anthology is a collection of 100 author responses to an improved creative visual storytelling exercise over a sequence of three images. Each item contains four facet entries, corresponding to Entity, Scene, Narrative, and Title.
75k photos of windows + 21k synthetic renders of building windows.
The SPKL dataset contains 1203 images of parking lots divided into 11 categories regarding vision conditions (including the 'winter' category absent in other datasets at the time of publishing).
Products for OCR and Information Extraction (POIE) dataset derives from camera images of various products in the real world. The images are carefully selected and manually annotated. Our labeling team consists of 8 experienced labelers. We first crop the nutrition tables from product images and adopt multiple commercial OCR engines (Azure and Baidu OCR) for pre-labeling. Then we use LabelMe to manually check the annotation of the location as well as transcription of every text box, and the values of entities for all the text in the images and repaired the OCR errors found. After discarding low-quality and blurred images, we obtain 3,000 images with 111,155 text instances.
ESP dataset (Evaluation for Styled Prompt dataset) is a benchmark for zero-shot domain-conditional caption generation. ESP is a new dataset focusing on providing multiple styled text targets for the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collect five text domains with everyday usage: blog, social media, instruction, story, and news.
RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.
RVL-CDIP_MP-N can serve its original goal as a covariate shift test set, now for multi-page document classification. We were able to retrieve the original full documents from DocumentCloud and Web Search.
CiNAT Birds 2021 (Cross-View iNaturalist-2021 Birds) dataset contains ground-level images of bird species along with satellite images associated with the geolocation of the ground-level images. In total, there are 413,959 pairs for training and 14,831 pairs for validation and testing. The ground-level images are of varying sizes while the satellite images are of size 256x256. Additionally, the dataset comes with rich metadata for each image - geolocation, date, observer id, taxonomy.
This is the Infrared Elephant Images Dataset (named 'EleThermal dataset') collected from here and annotated by our project, released under GPLv3. Therefore, if you use the annotated 'EleThermal' dataset for any research or other product by any means, please acknowledge the following two works by citing them.
This dataset was built with data acquired at the Hospital Clinic of Barcelona, Spain. It is composed of a total of 1126 HD polyp images. There are a total of 473 unique polyps, with a variable number of different shots per polyp (minimum: 2, maximum: 24, median: 10). Special attention was paid to ensure that images from the same polyp show different conditions. An external frame-grabber and a white light endoscope were used to capture raw images. The dataset contains images with two different resolutions: 1920 x 1080 and 1350 x 1080.
Faces Through Time (FTT) features 26,247 images of notable people from the 19th to 21st centuries, with roughly 1,900 images per decade on average. It is sourced from Wikimedia Commons, a crowdsourced and open-licensed collection of 50M images.
The Cifar10Mnist dataset is created using CIFAR-10 and MNIST data sources. Since the CIFAR-10 training set consists of 50000 images and the MNIST training set contains 60000 digits, the first 50000 digits from MNIST are padded on top of the CIFAR-10 images after making them slightly translucent. A first training dataset is then obtained (50000 images). Furthermore, the remaining 10000 MNIST digits are padded on top of 10000 random CIFAR10 images (with a fixed seed). This gives the possibility of having a second training dataset of 60000 images. For the test set, the 10000 CIFAR-10 images are padded over the 10000 MNIST digits.
TAMPAR is a real-world dataset of parcel photos for tampering detection with annotations in COCO format. For details see the paper and for visual samples the project page. Features are: