Definition

Visual-Linguistic-BERT (VL-BERT) extends the BERT architecture for video captioning task.

Architecture

The model appended visual feature embedding to the BERT architecture. The text part is trained in the same way as the BERT’s masked language modeling, and the image part is trained to estimate the label of the masked visual feature.

Visual Feature Embedding

The visual features embedding of the text tokens are obtained from the entire input image, and those of the image tokens are obtained from the objects estimated by an object detection model (Faster R-CNN).

My Knowledge Base

Explorer

Visual-Linguistic-BERT

Definition

Architecture

Visual Feature Embedding

Graph View

Table of Contents

Backlinks