TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/iPerceive: Applying Common-Sense Reasoning to Multi-Modal ...

iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering

Aman Chadha, Gurneet Arora, Navpreet Kaloty

2020-11-16Machine TranslationQuestion AnsweringCommon Sense ReasoningVideo Question AnsweringVideo CaptioningDense Video Captioning
PaperPDF

Abstract

Most prior art in visual understanding relies solely on analyzing the "what" (e.g., event recognition) and "where" (e.g., event localization), which in some cases, fails to describe correct contextual relationships between events or leads to incorrect underlying visual attention. Part of what defines us as human and fundamentally different from machines is our instinct to seek causality behind any association, say an event Y that happened as a direct result of event X. To this end, we propose iPerceive, a framework capable of understanding the "why" between events in a video by building a common-sense knowledge base using contextual cues to infer causal relationships between objects in the video. We demonstrate the effectiveness of our technique using the dense video captioning (DVC) and video question answering (VideoQA) tasks. Furthermore, while most prior work in DVC and VideoQA relies solely on visual information, other modalities such as audio and speech are vital for a human observer's perception of an environment. We formulate DVC and VideoQA tasks as machine translation problems that utilize multiple modalities. By evaluating the performance of iPerceive DVC and iPerceive VideoQA on the ActivityNet Captions and TVQA datasets respectively, we show that our approach furthers the state-of-the-art. Code and samples are available at: iperceive.amanchadha.com.

Results

TaskDatasetMetricValueModel
Video Question AnsweringTVQAAccuracy76.96iPerceive (Chadha et al., 2020)
Video CaptioningActivityNet CaptionsBLEU-32.93iPerceive (Chadha et al., 2020)
Video CaptioningActivityNet CaptionsBLEU-41.29iPerceive (Chadha et al., 2020)
Video CaptioningActivityNet CaptionsMETEOR7.87iPerceive (Chadha et al., 2020)
Dense Video CaptioningActivityNet CaptionsBLEU-32.93iPerceive (Chadha et al., 2020)
Dense Video CaptioningActivityNet CaptionsBLEU-41.29iPerceive (Chadha et al., 2020)
Dense Video CaptioningActivityNet CaptionsMETEOR7.87iPerceive (Chadha et al., 2020)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15