Anna Breit, Artem Revenko, Kiamehr Rezaee, Mohammad Taher Pilehvar, Jose Camacho-Collados
We present WiC-TSV, a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, we introduce a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as a binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for the evaluation of a diverse set of models and systems in and across domains. WiC-TSV provides three different evaluation settings, depending on the input signals provided to the model. We set baseline performance on the dataset using state-of-the-art language models. Experimental results show that even though these models can perform decently on the task, there remains a gap between machine and human performance, especially in out-of-domain settings. WiC-TSV data is available at https://competitions.codalab.org/competitions/23683
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: all | 75.3 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: domain specific | 77.9 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: general purpose | 73.3 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: all | 71.7 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: domain specific | 74.7 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: general purpose | 68.6 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: all | 76.6 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: domain specific | 80.4 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: general purpose | 73.5 | Bert-base |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: all | 54.4 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: domain specific | 60.6 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: general purpose | 49.2 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: all | 62.8 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: domain specific | 69.1 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: general purpose | 57.6 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: all | 60.5 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: domain specific | 67.9 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: general purpose | 54.4 | Unsupervised Bert |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: all | 53.7 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: domain specific | 50.6 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: general purpose | 56.2 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: all | 52.7 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: domain specific | 47.7 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: general purpose | 56.8 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: all | 53.4 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: domain specific | 49 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: general purpose | 57.1 | FastText |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: all | 50.8 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: domain specific | 47 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 1 Accuracy: general purpose | 53.8 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: all | 50.8 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: domain specific | 47 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 2 Accuracy: general purpose | 53.8 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: all | 50.8 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: domain specific | 47 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: general purpose | 53.8 | All true |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: all | 85.3 | Human |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: domain specific | 89.2 | Human |
| Word Sense Disambiguation | WiC-TSV | Task 3 Accuracy: general purpose | 82.1 | Human |
| Sentiment Analysis | TweetEval | Hate | 50.6 | FastText |
| Entity Linking | WiC-TSV | Task 1 Accuracy: all | 75.3 | Bert-base |
| Entity Linking | WiC-TSV | Task 1 Accuracy: domain specific | 77.9 | Bert-base |
| Entity Linking | WiC-TSV | Task 1 Accuracy: general purpose | 73.3 | Bert-base |
| Entity Linking | WiC-TSV | Task 2 Accuracy: all | 71.7 | Bert-base |
| Entity Linking | WiC-TSV | Task 2 Accuracy: domain specific | 74.7 | Bert-base |
| Entity Linking | WiC-TSV | Task 2 Accuracy: general purpose | 68.6 | Bert-base |
| Entity Linking | WiC-TSV | Task 3 Accuracy: all | 76.6 | Bert-base |
| Entity Linking | WiC-TSV | Task 3 Accuracy: domain specific | 80.4 | Bert-base |
| Entity Linking | WiC-TSV | Task 3 Accuracy: general purpose | 73.5 | Bert-base |
| Entity Linking | WiC-TSV | Task 1 Accuracy: all | 54.4 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 1 Accuracy: domain specific | 60.6 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 1 Accuracy: general purpose | 49.2 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 2 Accuracy: all | 62.8 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 2 Accuracy: domain specific | 69.1 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 2 Accuracy: general purpose | 57.6 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 3 Accuracy: all | 60.5 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 3 Accuracy: domain specific | 67.9 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 3 Accuracy: general purpose | 54.4 | Unsupervised Bert |
| Entity Linking | WiC-TSV | Task 1 Accuracy: all | 53.7 | FastText |
| Entity Linking | WiC-TSV | Task 1 Accuracy: domain specific | 50.6 | FastText |
| Entity Linking | WiC-TSV | Task 1 Accuracy: general purpose | 56.2 | FastText |
| Entity Linking | WiC-TSV | Task 2 Accuracy: all | 52.7 | FastText |
| Entity Linking | WiC-TSV | Task 2 Accuracy: domain specific | 47.7 | FastText |
| Entity Linking | WiC-TSV | Task 2 Accuracy: general purpose | 56.8 | FastText |
| Entity Linking | WiC-TSV | Task 3 Accuracy: all | 53.4 | FastText |
| Entity Linking | WiC-TSV | Task 3 Accuracy: domain specific | 49 | FastText |
| Entity Linking | WiC-TSV | Task 3 Accuracy: general purpose | 57.1 | FastText |
| Entity Linking | WiC-TSV | Task 1 Accuracy: all | 50.8 | All true |
| Entity Linking | WiC-TSV | Task 1 Accuracy: domain specific | 47 | All true |
| Entity Linking | WiC-TSV | Task 1 Accuracy: general purpose | 53.8 | All true |
| Entity Linking | WiC-TSV | Task 2 Accuracy: all | 50.8 | All true |
| Entity Linking | WiC-TSV | Task 2 Accuracy: domain specific | 47 | All true |
| Entity Linking | WiC-TSV | Task 2 Accuracy: general purpose | 53.8 | All true |
| Entity Linking | WiC-TSV | Task 3 Accuracy: all | 50.8 | All true |
| Entity Linking | WiC-TSV | Task 3 Accuracy: domain specific | 47 | All true |
| Entity Linking | WiC-TSV | Task 3 Accuracy: general purpose | 53.8 | All true |
| Entity Linking | WiC-TSV | Task 3 Accuracy: all | 85.3 | Human |
| Entity Linking | WiC-TSV | Task 3 Accuracy: domain specific | 89.2 | Human |
| Entity Linking | WiC-TSV | Task 3 Accuracy: general purpose | 82.1 | Human |