Metric: R@5 (higher is better)
| # | Model↕ | R@5▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | PaCE | 36.7 | No | PaCE: Unified Multi-modal Dialogue Pre-training ... | 2023-05-24 | Code |
| 2 | ViLT | 33.8 | No | ViLT: Vision-and-Language Transformer Without Co... | 2021-02-05 | Code |
| 3 | VLMo | 30 | No | VLMo: Unified Vision-Language Pre-Training with ... | 2021-11-03 | Code |
| 4 | SCAN | 27 | No | Stacked Cross Attention for Image-Text Matching | 2018-03-21 | Code |
| 5 | DE++ | 26.4 | No | PhotoChat: A Human-Human Dialogue Dataset with P... | 2021-07-06 | - |