1,019 machine learning datasets
1,019 dataset results
MuscleMap136 is a dataset for video-based Activated Muscle Group Estimation (AMGE) aiming at identifying currently activated muscular regions of humans performing a specific activity. Video-based AMGE is an important yet overlooked problem. To this intent, the MuscleMap136 dataset features 15K video clips with 136 different activities and 20 labeled muscle groups.
In this dataset UR5 robot used 6 tools: metal-scissor, metal-whisk, plastic-knife, plastic-spoon, wooden-chopstick, and wooden-fork to perform 6 behaviors: look, stirring-slow, stirring-fast, stirring-twist, whisk, and poke. The robot explored 15 objects: cane-sugar, chia-seed, chickpea, detergent, empty, glass-bead, kidney-bean, metal-nut-bolt, plastic-bead, salt, split-green-pea, styrofoam-bead, water, wheat, and wooden-button kept cylindrical containers. The robot performed 10 trials on each object using a tool, resulting in 5,400 interactions (6 tools x 6 behaviors x 15 objects x 10 trials). The robot records multiple sensory data (audio, RGB images, depth images, haptic, and touch images) while interacting with the objects.
A real-world stereo video dataset, containing 1200 frame pairs with real-world color and sharpness mismatches caused by beam splitter.
The laparoscopic surgery dataset is associated with our International Journal of Computer Assisted Radiology and Surgery (IJCARS) publication titled “DeSmoke-LAP: Improved Unpaired Image-to-Image Translation for Desmoking in Laparoscopic Surgery”. The training model of the proposed method is available as an open source on Github. We propose DeSmoke-LAP, a new method for removing smoke from real robotic laparoscopic hysterectomy videos. The proposed method is based on the unpaired image-to-image cycle-consistent generative adversarial network in which two novel loss functions, namely, inter-channel discrepancies and dark channel prior.
BEAR (Benchmark on video Action Recognition) is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications.
VR-Folding contains garment meshes of 4 categories from CLOTH3D dataset, namely Shirt, Pants, Top and Skirt. For flattening task, there are 5871 videos which contain 585K frames in total. For folding task, there are 3896 videos which contain 204K frames in total. The data for each frame include multi-view RGB-D images, object masks, full garment meshes, and hand poses.
ARKitTrack is a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGBD sequences, 455 targets, and 229.7K video frames in total. This dataset has 123.9K pixel-level target masks along with the bounding box annotations and frame-level attributes.
The IAW dataset contains 420 Ikea furniture pieces from 14 common categories e.g. sofa, bed, wardrobe, table, etc. Each piece of furniture comes with one or more user instruction manuals, which are first divided into pages and then further divided into independent steps cropped from each page (some pages contain more than one step and some pages do not contain instructions). There are 8568 pages and 8263 steps overall, on average 20.4 pages and 19.7 steps for each piece of furniture. We crawled YouTube to find videos corresponding to these instruction manuals and as such the conditions in the videos are diverse on many aspects e.g. duration, resolution, first- or third-person view, camera pose, background environment, number of assemblers, etc. The IAW dataset contains 1005 raw videos with a length of around 183 hours in total. Among them, approximately 114 hours of content are labeled as 15649 actions to match the corresponding step in the corresponding manual.
Please see our website and code repository for detailed description.
Existing benchmark datasets in real-world distribution shifts are generally synthetically generated via augmentations to simulate real-world shifts such as weather and camera rotation. The UCF101-DS dataset consists of real-world distribution shifts from user-generated videos without synthetic augmentation. It has videos for 47 UCF-101 classes with 63 different distribution shifts that can be categorized into 15 categories. A total of 536 unique videos split into a total of 4,708 clips. Each clip ranges from 7 to 10 seconds long.
We introduce a large-scale video dataset Slovo for Russian Sign Language task. Slovo dataset size is about 16 GB, and it contains 20400 RGB videos for 1000 sign language gestures from 194 singers. Each class has 20 samples. The dataset is divided into training set and test set by subject user_id. The training set includes 15300 videos, and the test set includes 5100 videos. The total video recording time is ~9.2 hours. About 35% of the videos are recorded in HD format, and 65% of the videos are in FullHD resolution. The average video length with gesture is 50 frames.
We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods and leads to improvements over previous work for the task of temporal action localization.
CN-Celeb-AV is a multi-genre AVPR dataset collected 'in the wild'. This dataset contains more than 420k video segments from 1,136 persons from public media.
MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:
ChinaOpen is a new video dataset targeted at open-world multimodal learning, with raw data gathered from Bilibili, a popular Chinese video-sharing website. The dataset has a large webly annotated training set of videos (associated with user-generated titles and tags) and a smaller manually annotated test set of videos (with manually checked user titles / tags, manually written captions, and manual labels describing what visual objects / actions / scenes shown in the visual content).
MSVD-Indonesian is derived from the MSVD dataset, which is obtained with the help of a machine translation service. This dataset can be used for multimodal video-text tasks, including text-to-video retrieval, video-to-text retrieval, and video captioning. Same as the original English dataset, the MSVD-Indonesian dataset contains about 80k video-text pairs.
PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.
VFD-2000 is a video fight detection dataset containing more than 2000 videos. YouTube is the data source. Specific scenarios are searched using “fight” as a search keyword, for example, “street fight”, “beach fight”, and “violence in the restaurant”. 200 videos under 20 different scenes are collected.
Replay is a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. The full Replay dataset consists of 68 scenes of social interactions between people, such as playing boarding games, exercising, or unwrapping presents. Each scene is about 5 minutes long and filmed with 12 cameras, static and dynamic. Audio is captured separately by 12 binaural microphones and additional near-range microphones for each actor and for each egocentric video. All sensors are temporally synchronized, undistorted, geometrically calibrated, and color calibrated.
The "Microbundle Time-lapse Dataset" contains 24 experimental time-lapse images of cardiac microbundles using three distinct types of experimental testbed of beating lab grown hiPSC-based cardiac microbundles. Of the 24 experimental time-lapse images, 23 examples are brightfield videos, and a single example is a phase contrast video. We categorize the different experimental testbeds into 3 types, where "Type 1" includes movies obtained from standard experimental microbundle platforms termed microbundle strain gauges [1,2,3]. We refer to data collected from non-standard platforms termed FibroTUGs [4] as "Type 2" data, and "Type 3" data represents a highly versatile and diverse nanofabricated experimental platform [5,6].