Shikib Mehri, Maxine Eskenazi
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Open-Domain Dialog | USR-TopicalChat | Pearson Correlation | 0.422 | USR |
| Open-Domain Dialog | USR-TopicalChat | Spearman Correlation | 0.4192 | USR |
| Open-Domain Dialog | USR-TopicalChat | Pearson Correlation | 0.4068 | USR - DR (x = c) |
| Open-Domain Dialog | USR-TopicalChat | Spearman Correlation | 0.3245 | USR - DR (x = c) |
| Open-Domain Dialog | USR-TopicalChat | Pearson Correlation | 0.3345 | USR - MLM |
| Open-Domain Dialog | USR-TopicalChat | Spearman Correlation | 0.3086 | USR - MLM |
| Open-Domain Dialog | USR-TopicalChat | Pearson Correlation | 0.3221 | USR - DR (x = f) |
| Open-Domain Dialog | USR-TopicalChat | Spearman Correlation | 0.1419 | USR - DR (x = f) |
| Open-Domain Dialog | USR-PersonaChat | Pearson Correlation | 0.6087 | USR - DR (x = c) |
| Open-Domain Dialog | USR-PersonaChat | Spearman Correlation | 0.4814 | USR - DR (x = c) |
| Open-Domain Dialog | USR-PersonaChat | Pearson Correlation | 0.4115 | USR |
| Open-Domain Dialog | USR-PersonaChat | Spearman Correlation | 0.4693 | USR |
| Open-Domain Dialog | USR-PersonaChat | Pearson Correlation | 0.0788 | USR - MLM |
| Open-Domain Dialog | USR-PersonaChat | Spearman Correlation | 0.0795 | USR - MLM |
| Open-Domain Dialog | USR-PersonaChat | Pearson Correlation | -0.0454 | USR - DR (x = f) |
| Open-Domain Dialog | USR-PersonaChat | Spearman Correlation | -0.0495 | USR - DR (x = f) |
| Dialogue Evaluation | USR-TopicalChat | Pearson Correlation | 0.422 | USR |
| Dialogue Evaluation | USR-TopicalChat | Spearman Correlation | 0.4192 | USR |
| Dialogue Evaluation | USR-TopicalChat | Pearson Correlation | 0.4068 | USR - DR (x = c) |
| Dialogue Evaluation | USR-TopicalChat | Spearman Correlation | 0.3245 | USR - DR (x = c) |
| Dialogue Evaluation | USR-TopicalChat | Pearson Correlation | 0.3345 | USR - MLM |
| Dialogue Evaluation | USR-TopicalChat | Spearman Correlation | 0.3086 | USR - MLM |
| Dialogue Evaluation | USR-TopicalChat | Pearson Correlation | 0.3221 | USR - DR (x = f) |
| Dialogue Evaluation | USR-TopicalChat | Spearman Correlation | 0.1419 | USR - DR (x = f) |
| Dialogue Evaluation | USR-PersonaChat | Pearson Correlation | 0.6087 | USR - DR (x = c) |
| Dialogue Evaluation | USR-PersonaChat | Spearman Correlation | 0.4814 | USR - DR (x = c) |
| Dialogue Evaluation | USR-PersonaChat | Pearson Correlation | 0.4115 | USR |
| Dialogue Evaluation | USR-PersonaChat | Spearman Correlation | 0.4693 | USR |
| Dialogue Evaluation | USR-PersonaChat | Pearson Correlation | 0.0788 | USR - MLM |
| Dialogue Evaluation | USR-PersonaChat | Spearman Correlation | 0.0795 | USR - MLM |
| Dialogue Evaluation | USR-PersonaChat | Pearson Correlation | -0.0454 | USR - DR (x = f) |
| Dialogue Evaluation | USR-PersonaChat | Spearman Correlation | -0.0495 | USR - DR (x = f) |