1,019 machine learning datasets
1,019 dataset results
Metaphorics is a newly introduced non-contextual skeleton action dataset. All the datasets introduced so far in the skeleton human action recognition have categories based only on verb-based actions.
Consists of manually annotated dangerous and non-dangerous Kiki challenge videos.
A large, annotated video dataset of mice performing a sequence of actions. The dataset was collected and labeled by experts for the purpose of neuroscience research.
Contains about 1, 000 videos from 10 queries and their video tags, manual annotations, and associated web images.
A dataset collected in a set of experiments that involves human participants and a robot.
A large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in the dataset describes a video in the form of "who does what and where."
The dataset is collected from the Youtube videos that contains fight instances in it. Also, some non-fight sequences from regular surveillance camera videos are included. * There are 300 videos in total as 150 fight + 150 non-fight * Videos are 2-second long * Only the fight related parts are included in the samples
The dataset is composed of 100 video sequences densely annotated with 60K bounding boxes, 17 sequence attributes, 13 action verb attributes and 29 target object attributes.
Contains three difficult real-world scenarios: uncontrolled videos taken by UAVs and manned gliders, as well as controlled videos taken on the ground. Over 160,000 annotated frames forhundreds of ImageNet classes are available, which are used for baseline experiments that assess the impact of known and unknown image artifacts and other conditions on common deep learning-based object classification approaches.
The dataset consists of 90 000 grayscale videos that show two objects of equal shape and size in which one object approaches the other one. The object speed during the process of approaching is hereby modelled by a proportional-derivative controller. Overall, three different shapes (Rectangle, Triangle and Circle) are provided. Initial configuration of the objects such as position and color were randomly sampled. Different from the moving MNIST dataset, the samples comprise a goal-oriented task, namely one object has to fully cover the other object rather than randomly moving, making it better suitable for testing prediction capabilities of an ML model. For instance, one can use it as a toy dataset to investigate the capacity and output behavior of a deep neural network before testing it on real-world data.
The i3-video dataset contains "is-it-instructional" annotations for 6.4k videos from Youtube-8M. The videos are considered to be instructional if they focus on real-world human actions accompanied by procedural language that explains what’s happening on screen in reasonable details.
Indian Institute of Science VIdeo Naturalness Evaluation (IISc VINE) is a database consisting of 300 videos, obtained by applying different prediction models on different datasets, and accompanying human opinion scores.
CASR is a dataset for cyclist arm signal recognition in videos. It contains 219 annotated arm signal actions on videos of approximately 10 seconds each, containing one or two actions per video.
This dataset consists of a number of sequences that were recorded with a VGA (640x480) event camera (Samsung DVS Gen3) and a conventional RGB camera (Huawei P20 Pro) placed on the windshield of a car driving through Zurich.
MIRACL-VC1 is a lip-reading dataset including both depth and color images. It can be used for diverse research fields like visual speech recognition, face detection, and biometrics. Fifteen speakers (five men and ten women) positioned in the frustum of an MS Kinect sensor and utter ten times a set of ten words and ten phrases (see the table below). Each instance of the dataset consists of a synchronized sequence of color and depth images (both of 640x480 pixels). The MIRACL-VC1 dataset contains a total number of 3000 instances.
A medium-scale synthetic 4D Light Field video dataset for depth (disparity) estimation. From the open-source movie Sintel. The dataset consists of 24 synthetic 4D LFVs with 1,204x436 pixels, 9x9 views, and 20–50 frames, and has ground-truth disparity values, so that can be used for training deep learning-based methods. Each scene was rendered with a clean pass after modifying the production file of Sintel with reference to the MPI Sintel dataset.
A dataset for flying honeybee detection introduced in "A Method for Detection of Small Moving Objects in UAV Videos".
Sara motion is a 3D motion dataset, named Synthetic Actors and Real Actions (SARA), for training a model to produce motion embeddings suitable for reasoning about motion similarity.
Motion similarity annotations for NTU RGB+D 120 dataset to evaluate motion similarity in the real world.
The POTUS Corpus is a Database of Weekly Addresses for the Study of Stance in Politics and Virtual Agents.