486 machine learning datasets
486 dataset results
The TUT Sounds Event 2018 dataset consists of real-life first order Ambisonic (FOA) format recordings with stationary point sources each associated with a spatial coordinate. The dataset was generated by collecting impulse responses (IR) from a real environment using the Eigenmike spherical microphone array. The measurement was done by slowly moving a Genelec G Two loudspeaker continuously playing a maximum length sequence around the array in circular trajectory in one elevation at a time. The playback volume was set to be 30 dB greater than the ambient sound level. The recording was done in a corridor inside the university with classrooms around it during work hours. The IRs were collected at elevations −40 to 40 with 10-degree increments at 1 m from the Eigenmike and at elevations −20 to 20 with 10-degree increments at 2 m.
DementiaBank is a shared database of multimedia interactions for the study of communication in dementia. The dataset contains 117 people diagnosed with Alzheimer Disease, and 93 healthy people, reading a description of an image. The principal task and benchmark is to classify each group.
Yesno is an audio dataset consisting of 60 recordings of one individual saying yes or no in Hebrew; each recording is eight words long. It was created for the Kaldi audio project by an author who wishes to remain anonymous.
Blind Audio-Visual Localization (BAVL) Dataset consists of 20 audio-visual recordings of sound sources, which could be talking faces or music instruments. Most audio-visual recordings (19) are videos from Youtube except V8, which is from 1. Besides, the video V7 was also used in2, and V16 used in 3. All 20 videos are annotated by ourselves in a uniform manner. Details of the video sequences are listed in Table 1.
The Persian Consonant Vowel Combination (PCVC) dataset is a phoneme based speech dataset, and also the first free Persian speech dataset to help Persian speech researchers. This dataset contains of 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples. The sample rate of all speech samples is 48000 which means there are 48000 sound samples in every 1 second. In each sample, sound starts with consonant and then there is a vowel sound and at last there is silent. length of silence is dependent on length of combination of consonant and vowel. For example if combination ends in 20000th data sample, so the rest of 10000 sample (until 30000, the length of each sound sample) are silence.
The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes The Speech corpus has been developed as part of PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
The MIVIA audio events data set is composed of a total of 6000 events for surveillance applications, namely glass breaking, gun shots and screams. The 6000 events are divided into a training set (composed of 4200 events) and a test set (composed of 1800 events).
The dataset includes recordings from 10 different users teaching the robot different common kitchen objects, that consists of synchronized recordings from three cameras and a microphone mounted on the robot:
The MOBIO database consists of bi-modal (audio and video) data taken from 152 people. The database has a female-male ratio or nearly 1:2 (100 males and 52 females) and was collected from August 2008 until July 2010 in six different sites from five different countries. This led to a diverse bi-modal database with both native and non-native English speakers.
Deeply vocal characterizer is a human nonverbal vocalization dataset. This sample dataset consists of about 0.6 hours(56.7 hours in the full set) of audio(16 kHz, 16-bit, mono) across 16 human nonverbal vocalization classes, including throat-clearing, coughing, laughing, panting, and etc. The audio contents are crowdsourced by the general public of South Korea.
MAEC is a new, large-scale multi-modal, text-audio paired, earnings-call dataset named MAEC, based on S&P 1500 companies.
This dataset was created for Fongbe automatic speech recognition task and contains about 3979 recordings of 13 participants reading a text written in Fongbe, one sentence at a time. Fongbe is a vernacular language spoken mainly in Benin, by more than 50% of the population, and a littke in Togo and in Nigeria. It’s an under-resourced because it lacks linguistics resources (speech corpus and text data) and very few websites provide textual data. In this dataset, each example contains the audio files and the associated text. The audio is high-quality (16-bit, 16kHz) recorded using an adroid app that we built for the need. The dataset is multi-speaker, containing recordings from 13 volunteers (male and female).
STVD-FC is the largest public dataset on the political content analysis and fact-checking tasks. It consists of more than 1,200 fact-checked claims that have been scraped from a fact-checking service with associated metadata. For the video counterpart, the dataset contains nearly 6,730 TV programs, having a total duration of 6,540 hours, with metadata. These programs have been collected during the 2022 French presidential election with a dedicated workstation and protocol. The dataset is delivered as different parts for accessibility of the 2 TB of data and proper indexes. More information about the STVD-FC dataset can be found into the publication [1].
This open-source dataset consists of 5.04 hours of transcribed English conversational speech beyond telephony, where 13 conversations were contained.
The subset of audio samples from the AudioSet ontology which are licensed with Creative Commons. This set contains approximately 10,000 samples of 10s long clips, and is freely modifiable and distributable. Each clip has with it, its full label set and unique ID.
This MCCS dataset is the first large-scale Mandarin Chinese Cued Speech dataset. This dataset covers 23 major categories of scenarios (e.g, communication, transportation and shoping) and 72 subcategories of scenarios (e.g, meeting, dating and introduction). It is recorded by four skilled native Mandarn Chinese Cued Speech cuers with portable cameras on the mobile phones. The Cued Speech videos are recorded with 30fps and 1280x720 format. We provide the raw Cued Speech videos, text file (with 1000 sentences) and corresponding annotations which contains two kind of data annotation. One is continuious video annotation with ELAN, the other is discrete audio annotations with Praat.
The volumetric representation of human interactions is one of the fundamental domains in the development of immersive media productions and telecommunication applications. Particularly in the context of the rapid advancement of Extended Reality (XR) applications, this volumetric data has proven to be an essential technology for future XR elaboration. In this work, we present a new multimodal database to help advance the development of immersive technologies. Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs). The recording system consists of a volumetric capture (VoCap) studio, including 31 synchronized modules with 62 RGB cameras and 31 depth cameras. In addition to textured meshes, point clouds, and multi-view RGB-D data, we use one Lytro Illum camera for providing light field (LF) data simul
Mudestreda Multimodal Device State Recognition Dataset obtained from real industrial milling device with Time Series and Image Data for Classification, Regression, Anomaly Detection, Remaining Useful Life (RUL) estimation, Signal Drift measurement, Zero Shot Flank Took Wear, and Feature Engineering purposes.
Saarbruecken Voice Database contains voice and EGG recordings of patients diagnosed with voice disorder, as well as healthy persons.
This publicly available data is synthesised audio for woodwind quartets including renderings of each instrument in isolation. The data was created to be used as training data within Cadenza's second open machine learning challenge (CAD2) for the task on rebalancing classical music ensembles. The dataset is also intended for developing other music information retrieval (MIR) algorithms using machine learning. It was created because of the lack of large-scale datasets of classical woodwind music with separate audio for each instrument and permissive license for reuse. Music scores were selected from the OpenScore String Quartet corpus. These were rendered for two woodwind ensembles of (i) flute, oboe, clarinet and bassoon; and (ii) flute, oboe, alto saxophone and bassoon. This was done by a professional music producer using industry-standard software. Virtual instruments were used to create the audio for each instrument using software that interpreted expression markings in the score. Co