TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LXMERT Model Compression for Visual Question Answering

LXMERT Model Compression for Visual Question Answering

Maryam Hashemi, Ghazaleh Mahmoudi, Sara Kodeiri, Hadi Sheikhi, Sauleh Eetemadi

2023-10-23Model CompressionVisual Question Answering (VQA)Visual Question Answering
PaperPDFCodeCode(official)

Abstract

Large-scale pretrained models such as LXMERT are becoming popular for learning cross-modal representations on text-image pairs for vision-language tasks. According to the lottery ticket hypothesis, NLP and computer vision models contain smaller subnetworks capable of being trained in isolation to full performance. In this paper, we combine these observations to evaluate whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA task. In addition, we perform a model size cost-benefit analysis by investigating how much pruning can be done without significant loss in accuracy. Our experiment results demonstrate that LXMERT can be effectively pruned by 40%-60% in size with 3% loss in accuracy.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy70.72LXMERT (low-magnitude pruning)
Visual Question Answering (VQA)VQA v2 test-stdAccuracy70.87LXMERT (low-magnitude pruning)
Visual Question AnsweringVQA v2 test-devAccuracy70.72LXMERT (low-magnitude pruning)
Visual Question AnsweringVQA v2 test-stdAccuracy70.87LXMERT (low-magnitude pruning)

Related Papers

LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression2025-07-21VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09