Visual Grounding

6 benchmarks571 papers

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence, or even a multi-round dialogue. There are three main challenges in VG:

What is the main focus in a query?
How to understand an image?
How to locate an object?

Benchmarks

Visual Grounding on RefCOCO+ test B

Visual Grounding on RefCOCO+ testA

Accuracy (%)IoU

Visual Grounding on RefCOCO+ val

Visual Grounding on RefCOCO testA

Visual Grounding on Who’s Waldo