Definition

VideoBERT extends the BERT architecture to handle both video and text data.
Architecture
The frames are sampled from an input video and the CNN (S3D) features of them are extracted and used as the visual tokens.
Linguistic-Visual Alignment Task
The model takes video segment-text pairs as input and determines if the pairs match.
Masked Language Modeling (MLM)
The textual part is trained in the same way as the BERT’s masked language modeling
Masked Frame Modeling (MFM)
The visual part is trained to estimate the cluster of the visual tokens, assigned by K-Means Clustering.