3,275 machine learning datasets
3,275 dataset results
Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It spans multiple categories—ranging from finance and legal documents to software UI elements and handwritten notes—ensuring a broad representation of real-world text appearances. Each video is annotated with frame indexes to facilitate consistent and reproducible OCR benchmarks. Currently, the dataset includes over 25 curated videos, yielding thousands of extracted frames that present a variety of text-related challenges.
Tables of the blendshapes from a group of the images of the FER2013 dataset, generated using MediaPipe library, based on the ARKit face blendshapes. with classes of the images in a separate column, describing the categories Happy, Unknown, Sad.
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity a
Keyword extraction is an integral task for many downstream problems like clustering, recommendation, search and classification. Development and evaluation of keyword extraction techniques require an exhaustive dataset; however, currently, the community lacks large-scale multi-lingual datasets. In this paper, we present MAKED, a large-scale multi-lingual keyword extraction dataset comprising of 540K+ news articles from British Broadcasting Corporation News (BBC News) spanning 20 languages. It is the first keyword extraction dataset for 11 of these 20 languages. The quality of the dataset is examined by experimentation with several baselines. We believe that the proposed dataset will help advance the field of automatic keyword extraction given its size, diversity in terms of languages used, topics covered and time periods as well as its focus on under-studied languages.
Dataset Description The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, specifically focusing on those labeled with the english-please tag. This label indicates that the original submission was written in a language other than English, providing a clear signal for multilingual content. The dataset spans a five-year period (March 2019--June 2024), ensuring a diverse representation of bug types, user environments, and technical contexts.
The R1-Onevision dataset is a meticulously crafted resource designed to empower models with advanced multimodal reasoning capabilities. Aimed at bridging the gap between visual and textual understanding, this dataset provides rich, context-aware reasoning tasks across diverse domains, including natural scenes, science, mathematical problems, OCR-based content, and complex charts.
VisCon-100K is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset
VisCon-100K is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset
The TwinSynths dataset is a novel benchmark designed to overcome common limitations found in earlier synthetic image datasets, such as low image quality, inadequate content preservation, and limited class diversity. TwinSynths generates pairs of images where each synthetic image is visually identical to its real counterpart, ensuring that the essential content remains intact while showcasing the unique architectural features of the generative models used. TwinSynths comprises two subsets:
This dataset is a collection of memes from various existing datasets, online forums, and freshly scrapped contents. It contains both global-context memes and Singapore-context memes, in different splits. It has textual description and a label stating if it is offensive under Singapore society's standards. It can be used to train content moderation models in a culturally complex society.
The HRI Dataset comprises a total of 3,200 image pairs. Each image pair comprises a clean background image, a depth image, a rain layer mask image, and a rainy image. It contains three scenes: lane, citystreet and japanesestreet, with image resolutions of $2048\times1024$. The lane scene contains 1,600 image pairs, consisting of images from 4 camera viewpoints, each viewpoint containing 100 images of different moments, and each moment containing 4 different intensities of rainy scenes. The citystreet scene contains 600 image pairs, consisting of images from 6 camera viewpoints, each viewpoint containing 25 images of different moments, and each moment containing 4 different intensities of rainy scenes. The japanesestreet scene contains 1,000 image pairs, consisting of images from 10 camera viewpoints, each viewpoint containing 25 images of different moments, and each moment containing 4 different intensities of rainy scenes.
To construct such a dataset, a straightforward approach was scraping images from the web. The main source for our dataset is ShipSpotting, which serves as a repository for user uploaded images, hosting a vast collection of ship images, amounting to approximately 3 million. Furthermore, for each image, valuable supplementary information is available, such as the type of the ship, and present and past names. Next, we made sure that as many images as possible were collected in our dataset, since in deep learning, the quantity of training data directly influences the quality of results.
The WORC database consists in total of 930 patients composed of six datasets gathered at the Erasmus MC, consisting of patients with: 1) well-differentiated liposarcoma or lipoma (115 patients); 2) desmoid-type fibromatosis or extremity soft-tissue sarcomas (203 patients); 3) primary solid liver tumors, either malignant (hepatocellular carcinoma or intrahepatic cholangiocarcinoma) or benign (hepatocellular adenoma or focal nodular hyperplasia) (186 patients); 4) gastrointestinal stromal tumors (GISTs) and intra-abdominal gastrointestinal tumors radiologically resembling GISTs (246 patients); 5) colorectal liver metastases (77 patients); and 6) lung metastases of metastatic melanoma (103 patients). For each patient, either a magnetic resonance imaging (MRI) or computed tomography (CT) scan, collected from routine clinical care, one or multiple (semi-)automatic lesion segmentations, and ground truth labels from a gold standard (e.g., pathologically proven) are available. All datasets are
This dataset comprises over 9,000 images captured in the AI2-THOR simulation environment, featuring 69 distinct object classes. It includes variations of certain objects, such as raw bread and cooked bread, to enhance diversity and realism. The images were collected using AI2-THOR's built-in tools and subsequently preprocessed and formatted for compatibility with YOLOv5, making it suitable for object detection tasks in simulated environments.
These datasets, ComCo and SimCo, designed for evaluating multi-object representation in Vision-Language Models (VLMs). These datasets provide controlled environments for analyzing model biases, object recognition, and compositionality in multi-object scenarios.
Street-View images captured at different timestamps often undergo geometric transformations. To make the VL-CMU-CD dataset more challenging and closer to real-world applications, we generate Unaligned VL-CMU-CD by pairing images with their adjacent neighbors within the same sequence. In this dataset, the adjacent neighbor distance is set to 2 to set to 2 to ensure a distinct difference from the original VL-CMU-CD.
SOTIF-PCOD is a dataset generated using the CARLA simulator, specifically designed for Safety of the Intended Functionality (SOTIF) research. It consists of 547 frames of LiDAR point cloud data formatted in the KITTI standard, representing a single SOTIF-related use case.
The Deepfake face detection task involves a facial image of unknown authenticity for testing. While most deepfake detection methods take only the image as input, our literature demonstrates that conditioning the deepfake detector on identity—i.e., knowing whose deepfake face the picture might be—can enhance detection performance. Existing deepfake detection datasets, such as FaceForensics++ and DFDC, do not include identity information for authentic and deepfake faces. This dataset contains facial images of 45 specific individuals, divided into train and test sets, including a total of 23k authentic and 22k deepfake images. Having a specific individual's images in both the train and test sets allows us to assess detection performance for that individual. The dataset is curated so that the train and test sets are from two independent sources. The train images are curated from the CelebDFv2 dataset, and the test images are curated from the CACD dataset. Deepfake faces are generated using
It is a large-scale multimodal patent dataset with detailed captions for design patent figures.
This dataset consists of annotated images and videos of smoke resulting from prescribed burning events in Finnish boreal forests. The dataset was created to train and validate learning-based methods for wildfire detection and smoke segmentation and its effectiveness in doing so was shown in the linked studies.