Context-aware Visual Attention-based (CoVA) webpage object detection pipeline
Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (CoVA) aims to learn function f to predict labels y = [] for a webpage containing N elements. The input to CoVA consists of:
This information is processed in four stages:
The graph representation extraction computes for every web element i its set of K neighboring web elements . The RN consists of a Convolutional Neural Net (CNN) and a positional encoder aimed to learn a visual representation for each web element i ∈ {1, ..., N}. The GAT combines the visual representation of the web element i to be classified and those of its neighbors, i.e., ∀k ∈ to compute the contextual representation for web element i. Finally, the visual and contextual representations of the web element are concatenated and passed through the FC layer to obtain the classification output.