Definition

Visual-Linguistic-BERT (VL-BERT) extends the BERT architecture for video captioning task.
Architecture
The model appended visual feature embedding to the BERT architecture. The text part is trained in the same way as the BERT’s masked language modeling, and the image part is trained to estimate the label of the masked visual feature.
Visual Feature Embedding
The visual features embedding of the text tokens are obtained from the entire input image, and those of the image tokens are obtained from the objects estimated by an object detection model (Faster R-CNN).