TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Best of Both Worlds: Transferring Knowledge from Discrimin...

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

2017-06-05NeurIPS 2017 12Visual DialogMetric LearningTransfer LearningInformativeness
PaperPDFCode(official)

Abstract

We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE), or minimizing the cross-entropy of the human responses. Across a variety of domains, a recurring problem with MLE trained generative neural dialog models (G) is that they tend to produce 'safe' and generic responses ("I don't know", "I can't tell"). In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses. However, D is not useful in practice since it cannot be deployed to have real conversations with users. Our work aims to achieve the best of both worlds -- the practical usefulness of G and the strong performance of D -- via knowledge transfer from D to G. Our primary contribution is an end-to-end trainable generative visual dialog model, where G receives gradients from D as a perceptual (not adversarial) loss of the sequence sampled from G. We leverage the recently proposed Gumbel-Softmax (GS) approximation to the discrete distribution -- specifically, an RNN augmented with a sequence of GS samplers, coupled with the straight-through gradient estimator to enable end-to-end differentiability. We also introduce a stronger encoder for visual dialog, and employ a self-attention mechanism for answer encoding along with a metric learning loss to aid D in better capturing semantic similarities in answer responses. Overall, our proposed model outperforms state-of-the-art on the VisDial dataset by a significant margin (2.67% on recall@10). The source code can be downloaded from https://github.com/jiasenlu/visDial.pytorch.

Results

TaskDatasetMetricValueModel
DialogueVisDial v0.9 valMRR62.22HCIAE-NP-ATT
DialogueVisDial v0.9 valMean Rank4.81HCIAE-NP-ATT
DialogueVisDial v0.9 valR@148.48HCIAE-NP-ATT
DialogueVisDial v0.9 valR@1087.59HCIAE-NP-ATT
DialogueVisDial v0.9 valR@578.75HCIAE-NP-ATT
Visual DialogVisDial v0.9 valMRR62.22HCIAE-NP-ATT
Visual DialogVisDial v0.9 valMean Rank4.81HCIAE-NP-ATT
Visual DialogVisDial v0.9 valR@148.48HCIAE-NP-ATT
Visual DialogVisDial v0.9 valR@1087.59HCIAE-NP-ATT
Visual DialogVisDial v0.9 valR@578.75HCIAE-NP-ATT

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Unsupervised Ground Metric Learning2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12$\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection2025-07-11