19,997 machine learning datasets
19,997 dataset results
Given a question in an Indic language and a passage in English, generate a short answer span. We provide both an English and target language answer span in the annotations.
BlendMimic3D is a pioneering synthetic dataset developed using Blender, designed to enhance Human Pose Estimation (HPE) research. This dataset features diverse scenarios including self-occlusions, object-based occlusions, and out-of-frame occlusions, tailored for the development and testing of advanced HPE models.
This dataset contains pre and post destruction images and also segmentation labels for test images.
This data set comes from breast cancer tissue samples deposited to The Cancer Genome Atlas (TCGA) project. TCGA contains data on tumour samples were assayed on several platforms; this data set compiles results obtained using Agilent mRNA expression microarrays.
Two datasets featuring binary and multi-class classification. The datasets were first introduced by K. Lang [1]. They can, for instance, be accessed at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
From the official description:
Sakuga-42M is a large-scale hand-drawn cartoon video dataset for academic research purposes, it comprises 42 million cartoon keyframes covering various artistic styles, regions, and years, with comprehensive semantic annotations including video-text description pairs, anime tags, content taxonomies, etc. The dataset is intended to support researchers in their exploration of more effective and practical solutions for creating cartoons.
The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs. The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the
The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"
OpenStreetView-5M establishes a new open benchmark for geolocation by providing a large, open, and clean dataset. As detailed below, OpenStreetView-5M improves upon several limitations of current geolocation datasets.
The Natural Scenes Dataset (NSD) is a large-scale fMRI dataset conducted at ultra-high-field (7T) strength at the Center of Magnetic Resonance Research (CMRR) at the University of Minnesota. The dataset consists of whole-brain, high-resolution (1.8-mm isotropic, 1.6-s sampling rate) fMRI measurements of 8 healthy adult subjects while they viewed thousands of color natural scenes over the course of 30–40 scan sessions. While viewing these images, subjects were engaged in a continuous recognition task in which they reported whether they had seen each given image at any point in the experiment. These data constitute a massive benchmark dataset for computational models of visual representation and cognition, and can support a wide range of scientific inquiry.
Low-light images and video footage often exhibit issues due to the interplay of various parameters such as aperture, shutter speed, and ISO settings. These interactions can lead to distortions, especially in extreme lighting conditions. This distortion is primarily caused by the inverse relationship between decreasing light intensity and increasing photon noise, which gets amplified with higher sensor gain. Additionally, secondary characteristics like white balance and color effects can also be adversely affected and may require post-processing correction. These distortions not only impact the perceived quality of the images but also pose significant challenges for machine learning tasks, including classification and object detection. This is particularly evident when considering the susceptibility of deep learning networks to adversarial examples.
Source
Appearance-based gaze estimation systems have shown great progress recently, yet the performance of these techniques depend on the datasets used for training. Most of the existing gaze estimation datasets setup in interactive settings were recorded in laboratory conditions and those recorded in the wild conditions display limited head pose and illumination variations. Further, we observed little attention so far towards precision evaluations of existing gaze estimation approaches. In this work, we present a large gaze estimation dataset, PARKS-Gaze, with wider head pose and illumination variation and with multiple samples for a single Point of Gaze (PoG). The dataset contains 974 minutes of data from 28 participants with a head pose range of ±60◦ in both yaw and pitch directions. Our within-dataset and cross-dataset evaluations and precision evaluations indicate that the proposed dataset is more challenging and enable models to generalize on unseen participants better than the existing
Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset <!-- Provide a quick summary of the dataset. -->
DARK FACE dataset provides 6,000 real-world low light images captured during the nighttime, at teaching buildings, streets, bridges, overpasses, parks etc., all labeled with bounding boxes for of human face, as the main training and/or validation sets. We also provide 9,000 unlabeled low-light images collected from the same setting. Additionally, we provided a unique set of 789 paired low-light/normal-light images captured in controllable real lighting conditions (but unnecessarily containing faces), which can be used as parts of the training data at the participants' discretization. There will be a hold-out testing set of 4,000 low-light images, with human face bounding boxes annotated.
To support the machine learning (ML) community in developing a time-cost-effective diagnostic assistant, we collaborate with ED clinicians to curate a benchmark, called MIMIC-ED-Assist, that is derived from MIMIC-IV and MIMIC-ED. MIMIC-ED-Assist is designed to test the ability of AI systems to provide both accurate and time-cost saving laboratory recommendations. Our benchmark consists of two prediction targets identified by our clinical collaborators to reflect patient risk: critical outcomes, which include patient death and ICU transfer, and lengthened ED stay, defined as ED LOS exceeding 24 hours. Accurately identifying patients at high risks of these outcomes reduces time-cost by allowing clinicians to perform timely interventions and efficiently allocate resources. MIMIC-ED-Assist mirrors real-world ED practices by grouping individual laboratory tests into commonly performed groups, e.g., complete blood count (CBC). MIMIC-ED-Assist then tests AI systems on their ability to recomme
The general multi-turn dialogue evaluation dataset with nine topics. Each topic has five representative cases, resulting in a comprehensive evaluation dataset of 45 cases.
The training subset consists of 15 robotic nephrectomy procedures captured on the da Vinci X or Xi system. There are 149 frames per video sequence, and the dimension of each frame is 1280x1024. Segmentation annotations are provided with 10 different classes, including instruments, kidneys, and other objects in the surgical scenario. The main differences with the 2017 instrument segmentation dataset are annotation of kidney parenchyma, surgical objects such as suturing needles, Suturing thread clips, and additional instruments. We annotated the graphical representation of the interaction between the surgical instruments and the defective tissue in the surgical scene with the help of our clinical expertise with the da Vinci Xi robotic system. We also delineate the bounding box to identify all the surgical objects. Kidney and instruments are represented as nodes and active edges annotated as the interaction class in the graph. In total, 12 kinds of interactions were identified to generat
Im4Sketch is a large-scale dataset with shape-oriented set of classes for image-to-sketch generalization . It consists of a collection of natural images from 874 categories for training and validation, and sketches from 393 categories (a subset of natural image categories) for testing.