TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HADA: A Graph-based Amalgamation Framework in Image-text R...

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrin

2023-01-11Image-text RetrievalText RetrievalImage-to-Text RetrievalRetrievalmage-to-Text RetrievalImage Retrieval
PaperPDFCodeCode(official)

Abstract

Many models have been proposed for vision and language tasks, especially the image-text retrieval task. All state-of-the-art (SOTA) models in this challenge contained hundreds of millions of parameters. They also were pretrained on a large external dataset that has been proven to make a big improvement in overall performance. It is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models, which are already available to use on the Internet. In this paper, we proposed a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result, rather than building from scratch. First, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model with each other. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we used the cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments showed that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and did not require many GPUs but only 1 to train due to its small number of parameters. The source code is available at https://github.com/m2man/HADA.

Results

TaskDatasetMetricValueModel
Image RetrievalFlickr30kRecall@181.36HADA
Image RetrievalFlickr30kRecall@1098.02HADA
Image RetrievalFlickr30kRecall@595.94HADA
Image RetrievalFlickr30kRecall@179.76ALBEF
Image RetrievalFlickr30kRecall@1097.72ALBEF
Image RetrievalFlickr30kRecall@595.3ALBEF
Image RetrievalFlickr30kRecall@175.56UNITER
Image RetrievalFlickr30kRecall@1096.76UNITER
Image RetrievalFlickr30kRecall@594.08UNITER
Image RetrievalMSCOCORecall@158.46HADA
Image RetrievalMSCOCORecall@1089.66HADA
Image RetrievalMSCOCORecall@582.85HADA
Image RetrievalMSCOCORecall@157.32BLIP
Image RetrievalMSCOCORecall@1088.92BLIP
Image RetrievalMSCOCORecall@581.84BLIP
Image RetrievalMSCOCORecall@137.02CLIP
Image RetrievalMSCOCORecall@1071.5CLIP
Image RetrievalMSCOCORecall@561.66CLIP
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1098UNITER
Image-to-Text RetrievalFlickr30kRecall@192.6ALBEF
Image-to-Text RetrievalFlickr30kRecall@1099.9ALBEF
Image-to-Text RetrievalFlickr30kRecall@599.3ALBEF
Image-to-Text RetrievalFlickr30kRecall@187.3UNITER
Image-to-Text RetrievalFlickr30kRecall@1099.2UNITER
Image-to-Text RetrievalFlickr30kRecall@598UNITER

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16