Papers With Code 2 | ML Benchmarks, SotA Results & Code

The Storytelling Video Dataset is a high-quality, human-reviewed multimodal dataset featuring over 700 full-body video recordings of native Russian speakers. Each video is 10+ minutes long and includes synchronized speech, facial expressions, gestures, and emotional variation. The dataset is ideal for research and development in:

🗣️ Speech-to-text & voice modeling (Russian)

😃 Emotion & gesture recognition

🤖 Multimodal learning & LLM alignment

🧍 Avatar generation & digital human training

Key Features:

Full-body framing (waist-up or knee-up)

Clear facial expressions and gestures

High-quality speech audio in Russian

Natural storytelling with emotional diversity

Preview (10 Participants): We’ve prepared a short compilation of 10 different speakers from the dataset. Each clip is taken from a unique 10-minute unscripted video, featuring full-body framing, gestures, and emotional speech.

📺 Watch the sample: https://drive.google.com/drive/folders/1QtC2il-Qb62nZNJlOtG8WvOf-SSkSGM-?usp=drive_link – Video Preview

Dataset Pages:

🐙 GitHub: github.com/MaratDV

📊 DataHub: datahub.io/@MaratDV/storytelling-video-dataset

email chinzad@gmail.com or telegram @Marat_DV