MuST-Cinema
ImagesIntroduced 2020-02-25
MuST-Cinema is a Multilingual Speech-to-Subtitles corpus ideal for building subtitle-oriented machine and speech translation systems. It comprises audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.
MuST-Cinema was built by annotating MuST-C with subtitle breaks based on the original subtitle files. Special symbols have been inserted in the aligned sentences to mark subtitle breaks as follows:
- <eob>: block break (breaks between subtitle blocks)
- <eol>: line breaks (breaks between lines inside the same subtitle block)
Source: MuST-Cinema