CCMatrix
Unknown
CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.
Source: CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
CCMatrix uses ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences.
Source: CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB