Probing Language Models on KAMEL

Metric: Average F1 (higher is better)

LeaderboardDataset