TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Gramian Multimodal Representation Learning and Alignment

Gramian Multimodal Representation Learning and Alignment

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, Danilo Comminiello

2024-12-16Video RetrievalRepresentation LearningZero-Shot Video RetrievalText RetrievalContrastive LearningVideo ClassificationZero-Shot Audio Retrieval
PaperPDFCodeCode(official)

Abstract

Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the $k$-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously. GRAM can replace cosine similarity in any downstream method, holding for 2 to $n$ modality and providing more meaningful alignment with respect to previous similarity measures. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification. The project page, the code, and the pretrained models are available at https://ispamm.github.io/GRAM/.

Results

TaskDatasetMetricValueModel
VideoVATEXtext-to-video R@187.7GRAM
VideoVATEXtext-to-video R@10100GRAM
VideoVATEXvideo-to-text R@184.6GRAM
VideoVATEXvideo-to-text R@10100GRAM
VideoActivityNettext-to-video R@169.9GRAM
VideoActivityNettext-to-video R@1096.1GRAM
VideoActivityNetvideo-to-text R@166.9GRAM
VideoActivityNetvideo-to-text R@1095.4GRAM
VideoDiDeMotext-to-video R@167.3GRAM
VideoDiDeMotext-to-video R@1090.1GRAM
VideoDiDeMovideo-to-text R@163.5GRAM
VideoDiDeMovideo-to-text R@1091.6GRAM
VideoMSR-VTTtext-to-video R@164GRAM
VideoMSR-VTTtext-to-video R@1089.3GRAM
VideoMSR-VTTvideo-to-text R@164.8GRAM
VideoMSR-VTTvideo-to-text R@1091.5GRAM
Video RetrievalVATEXtext-to-video R@187.7GRAM
Video RetrievalVATEXtext-to-video R@10100GRAM
Video RetrievalVATEXvideo-to-text R@184.6GRAM
Video RetrievalVATEXvideo-to-text R@10100GRAM
Video RetrievalActivityNettext-to-video R@169.9GRAM
Video RetrievalActivityNettext-to-video R@1096.1GRAM
Video RetrievalActivityNetvideo-to-text R@166.9GRAM
Video RetrievalActivityNetvideo-to-text R@1095.4GRAM
Video RetrievalDiDeMotext-to-video R@167.3GRAM
Video RetrievalDiDeMotext-to-video R@1090.1GRAM
Video RetrievalDiDeMovideo-to-text R@163.5GRAM
Video RetrievalDiDeMovideo-to-text R@1091.6GRAM
Video RetrievalMSR-VTTtext-to-video R@164GRAM
Video RetrievalMSR-VTTtext-to-video R@1089.3GRAM
Video RetrievalMSR-VTTvideo-to-text R@164.8GRAM
Video RetrievalMSR-VTTvideo-to-text R@1091.5GRAM
Zero-Shot Video RetrievalVATEXtext-to-video R@183.9GRAM
Zero-Shot Video RetrievalVATEXtext-to-video R@1099.5GRAM
Zero-Shot Video RetrievalVATEXvideo-to-text R@182.7GRAM
Zero-Shot Video RetrievalVATEXvideo-to-text R@1099GRAM
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@154.8GRAM
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1083.9GRAM
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@152.9GRAM
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1082.9GRAM
Zero-Shot Video RetrievalDiDeMotext-to-video R@154.2GRAM
Zero-Shot Video RetrievalDiDeMotext-to-video R@1080.7GRAM
Zero-Shot Video RetrievalDiDeMovideo-to-text R@152.3GRAM
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1080.3GRAM
Zero-Shot Video RetrievalActivityNettext-to-video R@159GRAM
Zero-Shot Video RetrievalActivityNettext-to-video R@1091.2GRAM
Zero-Shot Video RetrievalActivityNetvideo-to-text R@150.9GRAM
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1085.8GRAM

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16