Papers With Code 2 | ML Benchmarks, SotA Results & Code

This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.

Solving OCR usually does not depend on the context, while VQA usually does not have a unique solution to evaluate the quality of answers. VCR bridges the gap between OCR and VQA: it reconstructs the unique text in images while considering the context of the rest image.

VCR-wiki