TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/USR: An Unsupervised and Reference Free Evaluation Metric ...

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Shikib Mehri, Maxine Eskenazi

2020-05-01ACL 2020 6Text GenerationDialogue EvaluationOpen-Domain Dialog
PaperPDFCode(official)

Abstract

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Results

TaskDatasetMetricValueModel
Open-Domain DialogUSR-TopicalChatPearson Correlation0.422USR
Open-Domain DialogUSR-TopicalChatSpearman Correlation0.4192USR
Open-Domain DialogUSR-TopicalChatPearson Correlation0.4068USR - DR (x = c)
Open-Domain DialogUSR-TopicalChatSpearman Correlation0.3245USR - DR (x = c)
Open-Domain DialogUSR-TopicalChatPearson Correlation0.3345USR - MLM
Open-Domain DialogUSR-TopicalChatSpearman Correlation0.3086USR - MLM
Open-Domain DialogUSR-TopicalChatPearson Correlation0.3221USR - DR (x = f)
Open-Domain DialogUSR-TopicalChatSpearman Correlation0.1419USR - DR (x = f)
Open-Domain DialogUSR-PersonaChatPearson Correlation0.6087USR - DR (x = c)
Open-Domain DialogUSR-PersonaChatSpearman Correlation0.4814USR - DR (x = c)
Open-Domain DialogUSR-PersonaChatPearson Correlation0.4115USR
Open-Domain DialogUSR-PersonaChatSpearman Correlation0.4693USR
Open-Domain DialogUSR-PersonaChatPearson Correlation0.0788USR - MLM
Open-Domain DialogUSR-PersonaChatSpearman Correlation0.0795USR - MLM
Open-Domain DialogUSR-PersonaChatPearson Correlation-0.0454USR - DR (x = f)
Open-Domain DialogUSR-PersonaChatSpearman Correlation-0.0495USR - DR (x = f)
Dialogue EvaluationUSR-TopicalChatPearson Correlation0.422USR
Dialogue EvaluationUSR-TopicalChatSpearman Correlation0.4192USR
Dialogue EvaluationUSR-TopicalChatPearson Correlation0.4068USR - DR (x = c)
Dialogue EvaluationUSR-TopicalChatSpearman Correlation0.3245USR - DR (x = c)
Dialogue EvaluationUSR-TopicalChatPearson Correlation0.3345USR - MLM
Dialogue EvaluationUSR-TopicalChatSpearman Correlation0.3086USR - MLM
Dialogue EvaluationUSR-TopicalChatPearson Correlation0.3221USR - DR (x = f)
Dialogue EvaluationUSR-TopicalChatSpearman Correlation0.1419USR - DR (x = f)
Dialogue EvaluationUSR-PersonaChatPearson Correlation0.6087USR - DR (x = c)
Dialogue EvaluationUSR-PersonaChatSpearman Correlation0.4814USR - DR (x = c)
Dialogue EvaluationUSR-PersonaChatPearson Correlation0.4115USR
Dialogue EvaluationUSR-PersonaChatSpearman Correlation0.4693USR
Dialogue EvaluationUSR-PersonaChatPearson Correlation0.0788USR - MLM
Dialogue EvaluationUSR-PersonaChatSpearman Correlation0.0795USR - MLM
Dialogue EvaluationUSR-PersonaChatPearson Correlation-0.0454USR - DR (x = f)
Dialogue EvaluationUSR-PersonaChatSpearman Correlation-0.0495USR - DR (x = f)

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs2025-07-09FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation2025-07-09