Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g., `boat'), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun `it'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Dialogue | VisDial v0.9 val | MRR | 64.1 | CorefNMN (ResNet-152) |
| Dialogue | VisDial v0.9 val | Mean Rank | 4.45 | CorefNMN (ResNet-152) |
| Dialogue | VisDial v0.9 val | R@1 | 50.92 | CorefNMN (ResNet-152) |
| Dialogue | VisDial v0.9 val | R@10 | 88.81 | CorefNMN (ResNet-152) |
| Dialogue | VisDial v0.9 val | R@5 | 80.18 | CorefNMN (ResNet-152) |
| Dialogue | VisDial v0.9 val | MRR | 63.6 | CorefNMN |
| Dialogue | VisDial v0.9 val | Mean Rank | 4.53 | CorefNMN |
| Dialogue | VisDial v0.9 val | R@1 | 50.24 | CorefNMN |
| Dialogue | VisDial v0.9 val | R@10 | 88.51 | CorefNMN |
| Dialogue | VisDial v0.9 val | R@5 | 79.81 | CorefNMN |
| Dialogue | Visual Dialog v1.0 test-std | MRR (x 100) | 61.5 | CorefNMN (ResNet-152) |
| Dialogue | Visual Dialog v1.0 test-std | Mean | 4.4 | CorefNMN (ResNet-152) |
| Dialogue | Visual Dialog v1.0 test-std | NDCG (x 100) | 54.7 | CorefNMN (ResNet-152) |
| Dialogue | Visual Dialog v1.0 test-std | R@1 | 47.55 | CorefNMN (ResNet-152) |
| Dialogue | Visual Dialog v1.0 test-std | R@10 | 88.8 | CorefNMN (ResNet-152) |
| Dialogue | Visual Dialog v1.0 test-std | R@5 | 78.1 | CorefNMN (ResNet-152) |
| Common Sense Reasoning | Visual Dialog v0.9 | 1 in 10 R@5 | 80.1 | NMN [kottur2018visual] |
| Visual Dialog | VisDial v0.9 val | MRR | 64.1 | CorefNMN (ResNet-152) |
| Visual Dialog | VisDial v0.9 val | Mean Rank | 4.45 | CorefNMN (ResNet-152) |
| Visual Dialog | VisDial v0.9 val | R@1 | 50.92 | CorefNMN (ResNet-152) |
| Visual Dialog | VisDial v0.9 val | R@10 | 88.81 | CorefNMN (ResNet-152) |
| Visual Dialog | VisDial v0.9 val | R@5 | 80.18 | CorefNMN (ResNet-152) |
| Visual Dialog | VisDial v0.9 val | MRR | 63.6 | CorefNMN |
| Visual Dialog | VisDial v0.9 val | Mean Rank | 4.53 | CorefNMN |
| Visual Dialog | VisDial v0.9 val | R@1 | 50.24 | CorefNMN |
| Visual Dialog | VisDial v0.9 val | R@10 | 88.51 | CorefNMN |
| Visual Dialog | VisDial v0.9 val | R@5 | 79.81 | CorefNMN |
| Visual Dialog | Visual Dialog v1.0 test-std | MRR (x 100) | 61.5 | CorefNMN (ResNet-152) |
| Visual Dialog | Visual Dialog v1.0 test-std | Mean | 4.4 | CorefNMN (ResNet-152) |
| Visual Dialog | Visual Dialog v1.0 test-std | NDCG (x 100) | 54.7 | CorefNMN (ResNet-152) |
| Visual Dialog | Visual Dialog v1.0 test-std | R@1 | 47.55 | CorefNMN (ResNet-152) |
| Visual Dialog | Visual Dialog v1.0 test-std | R@10 | 88.8 | CorefNMN (ResNet-152) |
| Visual Dialog | Visual Dialog v1.0 test-std | R@5 | 78.1 | CorefNMN (ResNet-152) |