Definition

Visual Attention model applies Attention spatially, the hidden state of LSTM (: 1, 1, channels) is used as a query, and the CNN feature map (: height, width, channels) are used as a key and value. The result of the attention (: 1, 1, channels) is used as the hidden state of the next step.