YTD-18M

TextsVideosApache-2.0 licenseIntroduced 2023-03-17

YTD-18M is a large-scale corpus of 18M video-based dialogues, constructed from web videos: crucial to the data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.

Source: CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

Image Source: https://seungjuhan.me/champagne/