Definition

The model uses an LSTM architecture with Attention for video captioning task.

Architecture

The model has LSTM structure. The feature maps are extracted from the input frames using CNN and it is fed into the Attention LSTM, where the query (Q) is the previous hidden state, and key (K) and value (V) are the extracted feature maps.