ivrit.ai
database of Hebrew audio and text content.
ivrit.ai licenseIntroduced 2023-07-17
ivrit.ai is a database of Hebrew audio and text content.
audio-base contains the raw, unprocessed sources. About 13,000 hours of speech audio.
audio-vad contains audio snippets generated by applying Silero VAD (https://github.com/snakers4/silero-vad) to the base dataset. v1 data is generated using silero-vad's default parameters. v2 data is generated using min_speech_duration_ms=2000 (milliseconds), and max_speech_duration_s=30 (seconds).
audio-transcripts contain transcriptions for each snippet in the audio-vad dataset.
You can find the full list of sources in this dataset under https://www.ivrit.ai/en/credits.