Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

** LayoutReader** is a sequence-to-sequence model for reading order detection that uses both textual and layout information, where the layout-aware language model LayoutLM is leveraged as an encoder. The generation step in the encoder-decoder structure tis modified to generate the reading order sequence.

In the encoding stage, LayoutReader packs the pair of source and target segments into a contiguous input sequence of LayoutLM and carefully designs the self-attention mask to control the visibility between tokens. As shown in the Figure, LayoutReader allows the tokens in the source segment to attend to each other while preventing the tokens in the target segment from attending to the rightward context. If 1 means allowing and 0 means preventing, the detail of the mask $M$ is as follows:

$M\_{i, j}= \begin{cases}1, & \text { if } i<j \text { or } i, j \in \operatorname{src} \\ 0, & \text { otherwise }\end{cases}$

where $i, j$ are the indices in the packed input sequence, so they may be from source or target segments; $i, j \in$ src means both tokens are from source segment.

In the decoding stage, since the source and target are reordered sequences, the prediction candidates can be constrained to the source segment. Therefore, we ask the model to predict the indices in the source sequence. The probability is calculated as follows:

\mathcal{P}\left(x_{k}=i \mid x_{<k}\right)=\frac{\exp \left(e_{i}^{T} h\_{k}+b\_{k}\right)}{\sum_{j} \exp \left(e\_{j}^{T} h_{k}+b\_{k}\right)}

where $i$ is an index in the source segment; $e\_{i}$ and $e\_{j}$ are the $\mathrm{i}$ -th and $\mathrm{j}$ -th input embeddings of the source segment; $h\_{k}$ is the hidden states at the $\mathrm{k}$ -th time step; $b\_{k}$ is the bias at the $\mathrm{k}$ -th time step.

Description

$M\_{i, j}= \begin{cases}1, & \text { if } i<j \text { or } i, j \in \operatorname{src} \\ 0, & \text { otherwise }\end{cases}$

where $i, j$ are the indices in the packed input sequence, so they may be from source or target segments; $i, j \in$ src means both tokens are from source segment.

\mathcal{P}\left(x_{k}=i \mid x_{<k}\right)=\frac{\exp \left(e_{i}^{T} h\_{k}+b\_{k}\right)}{\sum_{j} \exp \left(e\_{j}^{T} h_{k}+b\_{k}\right)}

LayoutReader

Description

Papers Using This Method

LayoutReader

Description

Papers Using This Method