YTD-18M
TextsVideosApache-2.0 licenseIntroduced 2023-03-17
YTD-18M is a large-scale corpus of 18M video-based dialogues, constructed from web videos: crucial to the data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.
Source: CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
Image Source: https://seungjuhan.me/champagne/