486 machine learning datasets
486 dataset results
35 recordings of Candombe music with beat and downbeat annotations.
S. W. Hainsworth and M. D. Macleod, “Particle filtering applied to musical tempo tracking,” EURASIP Journal on Advances in Signal Processing, vol. 2004, pp. 1–11, 2004
Beats, downbeats, and functional structural annotations for 912 Pop tracks.
J. Hockman, M. E. Davies, and I. Fujinaga, “One in the jungle: Downbeat detection in hardcore, jungle, and drum and bass.” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2012.
Eremenko, E. Demirel, B. Bozkurt, and X. Serra, “Audio-aligned jazz harmony dataset for automatic chord transcription and corpus-based research,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018
F. Gouyon, “A computational approach to rhythm description — Audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing,” Ph.D. dissertation, Universitat Pompeu Fabra, 2006
A. Holzapfel, M. E. Davies, J. R. Zapata, J. L. Oliveira, and F. Gouyon, “Selective sampling for beat tracking evaluation,” Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2539–2548, 2012
J. Driedger, H. Schreiber, W. B. de Haas, and M. Müller, “Towards automatically correcting tapped beat annotations for music recordings.” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2019
A novel audio-visual mouse saliency (AViMoS) dataset with the following key-features:
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
A new large-scale, in-thewild Mandarin dataset, CAS-VSR-S101 with 101.1 hours of data. The videos are sourced from broadcast news and conversational programs in Chinese, covering a highly diverse set of topics, speakers and filming conditions. The lengths of the utterances are naturally distributed between 0.01s and 10.57s, and image qualities and resolutions vary. News accounts for 82.4% of the programs. 70.4% of the utterances depict news anchors, hosts and correspondents, while 29.6% are those of interviewees and guests. In addition, at a ratio of approximately 1.5 : 1, male and female appearances are relatively balanced. It is divided into train, validation and test sets by TV channels to minimize speaker overlap, and at a ratio of roughly 8 : 1 : 1.5 in terms of duration. The validation and test sets are composed of programs broadcast on provincial TV channels. The dataset is available for academic use under a license.
Characterising multimedia content with relevant, reliable and discriminating tags is vital for multimedia information retrieval. With the rapid expansion of digital multimedia content, alternative methods to the existing explicit tagging are needed to enrich the pool of tagged content. Currently, social media websites encourage users to tag their content. However, the users’ intent when tagging multimedia content does not always match the information retrieval goals. A large portion of user defined tags are either motivated by increasing the popularity and reputation of a user in an online com-munity or based on individual and egoistic judgments. Moreover, users do not evaluate media content on the same criteria. Some might tag multimedia content with words to express their emotion while others might use tags to describe the content. For example, a picture receive different tags based on the objects in the image, the camera by which the picture was taken or the emotion a user felt look
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Many research articles have explored the impact of surgical interventions on voice and speech evaluations, but advances are limited by the lack of publicly accessible datasets. To address this, a comprehensive corpus of 107 Spanish Castilian speakers was recorded, including control speakers and patients who underwent upper airway surgeries such as Tonsillectomy, Functional Endoscopic Sinus Surgery, and Septoplasty. The dataset contains 3,800 audio files, averaging 35.51 ± 5.91 recordings per patient. This resource enables systematic investigation of the effects of upper respiratory tract surgery on voice and speech. Previous studies using this corpus have shown no relevant changes in key acoustic parameters for sustained vowel phonation, consistent with initial hypotheses. However, the analysis of speech recordings, particularly nasalised segments, remains open for further research. Additionally, this dataset facilitates the study of the impact of upper airway surgery on speaker recogn
Temporal Dataset for Indoor and In-Vehicle Thermal Comfort Estimation Abstract Thermal comfort estimation is essential for enhancing user experience in static indoor environments and dynamic in-vehicle scenarios. While traditional datasets focus on buildings, their application to fast-changing conditions, such as in vehicles, remains unexplored. We address this gap by introducing two temporal datasets collected from (1) a self-built climatic chamber with 31 sensor signals and user-labeled ratings from 18 participants and (2) in-vehicle studies with 20 participants in a BMW 3 Series.
The United-Syn-Med dataset is a specialized medical speech dataset designed to evaluate and improve Automatic Speech Recognition (ASR) systems within the healthcare domain. It comprises English medical speech recordings, with a particular focus on medical terminology and clinical conversations. The dataset is well-suited for various ASR tasks, including speech recognition, transcription, and classification, facilitating the development of models tailored for medical contexts.
PC-GITA is a Spanish speech corpus designed to analyze speech impairments in individuals with Parkinson's Disease (PD).
Guitar-TECHS is a comprehensive dataset featuring a variety of guitar techniques, musical excerpts, chords, and scales. These elements are performed by diverse musicians across various recording settings. Guitar-TECHS incorporates recordings from two stereo microphones: an egocentric microphone positioned on the performer’s head and an exocentric microphone placed in front of the performer. It also includes direct input recordings and microphoned amplifier outputs, offering a wide spectrum of audio inputs and recording qualities. All signals and MIDI labels are properly synchronized. Its multi-perspective and multi-modal content makes Guitar-TECHS a valuable resource for advancing data-driven guitar research, and to develop robust guitar listening algorithms.
We collect a dataset of 805 clean videos that show the action of pouring water in a container. Our dataset spans over 50 unique containers made of 5 different materials, 4 different shapes and with hot and cold water.