Definition

VideoBERT extends the BERT architecture to handle both video and text data.

Architecture

The frames are sampled from an input video and the CNN (S3D) features of them are extracted and used as the visual tokens.

The model takes video segment-text pairs as input and determines if the pairs match.

The textual part is trained in the same way as the BERT’s masked language modeling

The visual part is trained to estimate the cluster of the visual tokens, assigned by K-Means Clustering.