CCMatrix

Unknown

CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.

Source: CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB