Definition

Contrastive Bidirectional Transformer (CBT) applied Contrastive Learning to the training of BERT for video captioning task.

Architecture

CBT consists of two streams: video and textual stream. The two streams extracts word and frame features respectively and those features are concatenated and pass through the cross-model transformer to make an estimation. The text stream uses pre-trained BERT, and the video part is trained through Contrastive Learning, if two frame features originated from the same video then positive otherwise negative.

Cross-Model Transformer

The cross-model transformer model takes video segment-text pairs as input and determines if the pairs match.