A Scale Aggregation Block concatenates feature maps at a wide range of scales. Feature maps for each scale are generated by a stack of downsampling, convolution and upsampling operations. The proposed scale aggregation block is a standard computational module which readily replaces any given transformation Y=T(X), where X∈RH×W×C, Y∈RH×W×Co with C and Co being the input and output channel number respectively. T is any operator such as a convolution layer or a series of convolution layers. Assume we have L scales. Each scale l is generated by sequentially conducting a downsampling Dl, a transformation Tl and an unsampling operator Ul:
Xl′=Dl(X),\labeleq:eqd
Yl′=Tl(Xl′),\labeleq:eqtl
Yl=Ul(Yl′),\labeleq:equ
where Xl′∈RHl×Wl×C,
Yl′∈RHl×Wl×Cl, and
Yl∈RH×W×Cl.
Notably, Tl has the similar structure as T.
We can concatenate all L scales together, getting
Y′=∥1LUl(Tl(Dl(X))),\labeleq:eqall
where ∥ indicates concatenating feature maps along the channel dimension, and Y′∈RH×W×∑1LCl is the final output feature maps of the scale aggregation block.
In the reference implementation, the downsampling Dl with factor s is implemented by a max pool layer with s×s kernel size and s stride. The upsampling Ul is implemented by resizing with the nearest neighbor interpolation.