RuTermEval (Track 3)

CL-RuTerm3

Introduced 2025-04-23

CL-RuTerm3 dataset is a novel resource featuring nested term annotations across six domains (the main one is computational linguistics, also mathematics, medicine, economics, literature studies, and agrochemistry), and the RuTermEval-2024 competition, designed to evaluate term extraction systems on this data. The CL-RuTerm3 dataset, comprising 1270 abstracts and 15 full-text articles (over 165k tokens with over 37k annotated entities), is the largest of its kind for Russian scientific texts. Terms are classified into three categories based on lexical and domain specificity: specific terms, common terms, and nomens. The dataset’s unique features, such as nested term markup and cross-domain coverage, enable more realistic evaluation of ATE systems.

Third track is devoted to Nested term extraction (in sequence labeling format) and classification (labels are specific, common, nomen) in cross-domain task.