Definition

Show, attend, and tell (SAT) is a Attention based image captioning model.
Architecture
The model has LSTM structure. The feature map is extracted from the input image using CNN and it is fed into the Attention LSTM, where the query (Q) is the previous hidden state, and key (K) and value (V) are the vectors of the feature map.