[論文メモ] Mask2Former for Video Instance Segmentation

Mask2Formerを動画に拡張した。著者はMask2Formerと大体同じ。
f:id:Ninhydrin:20211223094710p:plain

手法

Mask2Formerから3つの変更。

Joint spatio-temporal masked attention

Mask2Formerは次元がheightとwidthだったが、そこにtimeの次元を追加した(だけ)。
なのでマスクは以下の式(2)になる。なお $\textbf{M}_{l-1} \in \{0, 1\}^{N \times TH_lW_l}$ 。
f:id:Ninhydrin:20211223094928p:plain

Temporal positional encoding

positional encodingを時間方向に拡張。
f:id:Ninhydrin:20211223095132p:plain

$e_{\verb|pos|} \in \mathbb{R}^{T \times H_l \times W_l \times C}$ で各要素は $e_{\verb|pos-t|} \in \mathbb{R}^{T \times 1 \times 1 \times C}$ 、 $e_{\verb|pos-xy|} \in \mathbb{R}^{1 \times H_l \times W_l \times C}$ 。 $\oplus$ はnumpy-styleのbroadcasting。つまりheight、width、timeに関してユニークになる。