UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Xiaotao Gu, Zihan Wang, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang

2021-05-28Keyphrase Extraction Phrase Ranking Phrase Tagging Language Modelling

Abstract

Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

Results

Task	Dataset	Metric	Value	Model
Phrase Ranking	KP20k	P@50K	98.5	Wiki+RoBERTa
Phrase Ranking	KP20k	P@5K	100	Wiki+RoBERTa
Phrase Ranking	KP20k	P@50K	96.5	UCPhrase
Phrase Ranking	KP20k	P@5K	96.5	UCPhrase
Phrase Ranking	KP20k	P@50K	78	TopMine
Phrase Ranking	KP20k	P@5K	81.5	TopMine
Phrase Ranking	KPTimes	P@50K	96.5	Wiki+RoBERTa
Phrase Ranking	KPTimes	P@5K	99	Wiki+RoBERTa
Phrase Ranking	KPTimes	P@50K	95.5	UCPhrase
Phrase Ranking	KPTimes	P@5K	96.5	UCPhrase
Phrase Ranking	KPTimes	P@50K	95.5	AutoPhrase
Phrase Ranking	KPTimes	P@5K	96.5	AutoPhrase
Phrase Ranking	KPTimes	P@50K	71	TopMine
Phrase Ranking	KPTimes	P@5K	85.5	TopMine
Keyphrase Extraction	KP20k	F1@10	19.2	Wiki+RoBERTa
Keyphrase Extraction	KP20k	Recall	73	Wiki+RoBERTa
Keyphrase Extraction	KP20k	F1@10	19.7	UCPhrase
Keyphrase Extraction	KP20k	Recall	72.9	UCPhrase
Keyphrase Extraction	KP20k	F1@10	18.2	AutoPhrase
Keyphrase Extraction	KP20k	Recall	62.9	AutoPhrase
Keyphrase Extraction	KP20k	F1@10	15.3	Spacy
Keyphrase Extraction	KP20k	Recall	59.5	Spacy
Keyphrase Extraction	KP20k	F1@10	12.6	PKE
Keyphrase Extraction	KP20k	Recall	57.1	PKE
Keyphrase Extraction	KP20k	F1@10	15	TopMine
Keyphrase Extraction	KP20k	Recall	53.3	TopMine
Keyphrase Extraction	KP20k	F1@10	13.9	StanfordNLP
Keyphrase Extraction	KP20k	Recall	51.7	StanfordNLP
Keyphrase Extraction	KPTimes	F1@10	10.9	UCPhrase
Keyphrase Extraction	KPTimes	Recall	83.4	UCPhrase
Keyphrase Extraction	KPTimes	F1@10	10.3	AutoPhrase
Keyphrase Extraction	KPTimes	Recall	77.8	AutoPhrase
Keyphrase Extraction	KPTimes	F1@10	9.4	Wiki+RoBERTa
Keyphrase Extraction	KPTimes	Recall	64.5	Wiki+RoBERTa
Keyphrase Extraction	KPTimes	F1@10	8.5	TopMine
Keyphrase Extraction	KPTimes	Recall	63.4	TopMine
Phrase Tagging	KPTimes	F1	73.5	UCPhrase
Phrase Tagging	KPTimes	Precision	69.1	UCPhrase
Phrase Tagging	KPTimes	Recall	78.9	UCPhrase
Phrase Tagging	KPTimes	F1	63.2	Wiki+RoBERTa
Phrase Tagging	KPTimes	Precision	60.9	Wiki+RoBERTa
Phrase Tagging	KPTimes	Recall	65.6	Wiki+RoBERTa
Phrase Tagging	KPTimes	F1	45.9	AutoPhrase
Phrase Tagging	KPTimes	Precision	44.2	AutoPhrase
Phrase Tagging	KPTimes	Recall	47.7	AutoPhrase
Phrase Tagging	KPTimes	F1	34	TopMine
Phrase Tagging	KPTimes	Precision	32	TopMine
Phrase Tagging	KPTimes	Recall	36.3	TopMine
Phrase Tagging	KP20k	F1	73.9	UCPhrase
Phrase Tagging	KP20k	Precision	69.9	UCPhrase
Phrase Tagging	KP20k	Recall	78.3	UCPhrase
Phrase Tagging	KP20k	F1	61	Wiki+RoBERTa
Phrase Tagging	KP20k	Precision	58.1	Wiki+RoBERTa
Phrase Tagging	KP20k	Recall	64.2	Wiki+RoBERTa
Phrase Tagging	KP20k	F1	49.7	AutoPhrase
Phrase Tagging	KP20k	Precision	55.2	AutoPhrase
Phrase Tagging	KP20k	Recall	45.2	AutoPhrase
Phrase Tagging	KP20k	F1	40.6	TopMine
Phrase Tagging	KP20k	Precision	39.8	TopMine
Phrase Tagging	KP20k	Recall	41.4	TopMine

Abstract

Results

Task	Dataset	Metric	Value	Model
Phrase Ranking	KP20k	P@50K	98.5	Wiki+RoBERTa
Phrase Ranking	KP20k	P@5K	100	Wiki+RoBERTa
Phrase Ranking	KP20k	P@50K	96.5	UCPhrase
Phrase Ranking	KP20k	P@5K	96.5	UCPhrase
Phrase Ranking	KP20k	P@50K	78	TopMine
Phrase Ranking	KP20k	P@5K	81.5	TopMine
Phrase Ranking	KPTimes	P@50K	96.5	Wiki+RoBERTa
Phrase Ranking	KPTimes	P@5K	99	Wiki+RoBERTa
Phrase Ranking	KPTimes	P@50K	95.5	UCPhrase
Phrase Ranking	KPTimes	P@5K	96.5	UCPhrase
Phrase Ranking	KPTimes	P@50K	95.5	AutoPhrase
Phrase Ranking	KPTimes	P@5K	96.5	AutoPhrase
Phrase Ranking	KPTimes	P@50K	71	TopMine
Phrase Ranking	KPTimes	P@5K	85.5	TopMine
Keyphrase Extraction	KP20k	F1@10	19.2	Wiki+RoBERTa
Keyphrase Extraction	KP20k	Recall	73	Wiki+RoBERTa
Keyphrase Extraction	KP20k	F1@10	19.7	UCPhrase
Keyphrase Extraction	KP20k	Recall	72.9	UCPhrase
Keyphrase Extraction	KP20k	F1@10	18.2	AutoPhrase
Keyphrase Extraction	KP20k	Recall	62.9	AutoPhrase
Keyphrase Extraction	KP20k	F1@10	15.3	Spacy
Keyphrase Extraction	KP20k	Recall	59.5	Spacy
Keyphrase Extraction	KP20k	F1@10	12.6	PKE
Keyphrase Extraction	KP20k	Recall	57.1	PKE
Keyphrase Extraction	KP20k	F1@10	15	TopMine
Keyphrase Extraction	KP20k	Recall	53.3	TopMine
Keyphrase Extraction	KP20k	F1@10	13.9	StanfordNLP
Keyphrase Extraction	KP20k	Recall	51.7	StanfordNLP
Keyphrase Extraction	KPTimes	F1@10	10.9	UCPhrase
Keyphrase Extraction	KPTimes	Recall	83.4	UCPhrase
Keyphrase Extraction	KPTimes	F1@10	10.3	AutoPhrase
Keyphrase Extraction	KPTimes	Recall	77.8	AutoPhrase
Keyphrase Extraction	KPTimes	F1@10	9.4	Wiki+RoBERTa
Keyphrase Extraction	KPTimes	Recall	64.5	Wiki+RoBERTa
Keyphrase Extraction	KPTimes	F1@10	8.5	TopMine
Keyphrase Extraction	KPTimes	Recall	63.4	TopMine
Phrase Tagging	KPTimes	F1	73.5	UCPhrase
Phrase Tagging	KPTimes	Precision	69.1	UCPhrase
Phrase Tagging	KPTimes	Recall	78.9	UCPhrase
Phrase Tagging	KPTimes	F1	63.2	Wiki+RoBERTa
Phrase Tagging	KPTimes	Precision	60.9	Wiki+RoBERTa
Phrase Tagging	KPTimes	Recall	65.6	Wiki+RoBERTa
Phrase Tagging	KPTimes	F1	45.9	AutoPhrase
Phrase Tagging	KPTimes	Precision	44.2	AutoPhrase
Phrase Tagging	KPTimes	Recall	47.7	AutoPhrase
Phrase Tagging	KPTimes	F1	34	TopMine
Phrase Tagging	KPTimes	Precision	32	TopMine
Phrase Tagging	KPTimes	Recall	36.3	TopMine
Phrase Tagging	KP20k	F1	73.9	UCPhrase
Phrase Tagging	KP20k	Precision	69.9	UCPhrase
Phrase Tagging	KP20k	Recall	78.3	UCPhrase
Phrase Tagging	KP20k	F1	61	Wiki+RoBERTa
Phrase Tagging	KP20k	Precision	58.1	Wiki+RoBERTa
Phrase Tagging	KP20k	Recall	64.2	Wiki+RoBERTa
Phrase Tagging	KP20k	F1	49.7	AutoPhrase
Phrase Tagging	KP20k	Precision	55.2	AutoPhrase
Phrase Tagging	KP20k	Recall	45.2	AutoPhrase
Phrase Tagging	KP20k	F1	40.6	TopMine
Phrase Tagging	KP20k	Precision	39.8	TopMine
Phrase Tagging	KP20k	Recall	41.4	TopMine

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Abstract

Results

Related Papers

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Abstract

Results

Related Papers