TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Joint Sequence Fusion Model for Video Question Answering...

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Youngjae Yu, Jongseok Kim, Gunhee Kim

2018-08-07ECCV 2018 9Question AnsweringVideo RetrievalVideo Question AnsweringSemantic SimilaritySemantic Textual SimilarityRetrievalVisual Question Answering (VQA)Multiple-choice
PaperPDFCodeCode

Abstract

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank13JSFusion
VideoMSR-VTT-1kAtext-to-video R@110.2JSFusion
VideoMSR-VTT-1kAtext-to-video R@1043.2JSFusion
VideoMSR-VTT-1kAtext-to-video R@531.2JSFusion
VideoMSR-VTTtext-to-video Median Rank13JSFusion
VideoMSR-VTTtext-to-video R@110.2JSFusion
VideoMSR-VTTtext-to-video R@1043.2JSFusion
VideoMSR-VTTvideo-to-text R@531.2JSFusion
VideoLSMDCtext-to-video Median Rank36JSFusion
VideoLSMDCtext-to-video R@19.1JSFusion
VideoLSMDCtext-to-video R@1034.1JSFusion
VideoLSMDCtext-to-video R@521.2JSFusion
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank13JSFusion
Video RetrievalMSR-VTT-1kAtext-to-video R@110.2JSFusion
Video RetrievalMSR-VTT-1kAtext-to-video R@1043.2JSFusion
Video RetrievalMSR-VTT-1kAtext-to-video R@531.2JSFusion
Video RetrievalMSR-VTTtext-to-video Median Rank13JSFusion
Video RetrievalMSR-VTTtext-to-video R@110.2JSFusion
Video RetrievalMSR-VTTtext-to-video R@1043.2JSFusion
Video RetrievalMSR-VTTvideo-to-text R@531.2JSFusion
Video RetrievalLSMDCtext-to-video Median Rank36JSFusion
Video RetrievalLSMDCtext-to-video R@19.1JSFusion
Video RetrievalLSMDCtext-to-video R@1034.1JSFusion
Video RetrievalLSMDCtext-to-video R@521.2JSFusion

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17