Individuation in Neural Models with and without Visual Grounding

Alexey Tikhonov, Lisa Bylinina, Ivan P. Yamshchikov

2024-09-27Visual Grounding

Abstract

We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

Related Papers

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15 VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation2025-07-09 A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding2025-07-09 High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08 GTA1: GUI Test-time Scaling Agent2025-07-08 DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World2025-06-30 SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding2025-06-27 GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding2025-06-26