ETH Py150 Open

A massive, deduplicated corpus of 7.4M Python files from GitHub.

Source: Learning and Evaluating Contextual Embedding of Source Code