Papers With Code 2 | ML Benchmarks, SotA Results & Code

Education is increasingly data-driven, and the ability to analyse and adapt educational materials quickly and effectively is important for keeping materials contemporary and interesting. These approaches also have the potential to personalise learning experiences. One of the challenges in this domain is aligning new literature with the appropriate educational stages. This dataset aims to contribute to alleviating this knowledge gap.

This dataset has been generated through literature in the public domain from Project Gutenberg, and cross-referenced by the UK Key Stage equivalents from the Lexile Reading Framework.

The dataset contains a total of 20,000 rows evenly distributed across four educational stages - Key Stage 2 (KS2), Key Stage 3 (KS3), Key Stage 4 (KS4), and Key Stage 5 (KS5).

The data has been split into Train (80%, 16,000 objects) and Test (20%, 4,000 objects) sets.

The data is multimodal and contains:

Text - the cropped excerpt of text, which is limited to 512 tokens to the nearest complete sentence.
Linguistic Features - each extracted from the text excerpt

UK Key Stage Readability

Benchmarks