My Knowledge Base

Home

❯

3. Resource

❯

Describing Videos by Exploiting Temporal Structure

Mar 12, 20261 min read

machine_learning/deep_learning
machine_learning/multimodal_learning
machine_learning/computer_vision/video_understanding

Definition

The model uses an LSTM architecture with Attention for video captioning task.

Architecture

The model has LSTM structure. The feature maps are extracted from the input frames using CNN and it is fed into the Attention LSTM, where the query (Q) is the previous hidden state, and key (K) and value (V) are the extracted feature maps.

Graph View

Definition
Architecture

Backlinks

Computer Vision Note

GitHub
Discord Community