Definition

Contrastive Bidirectional Transformer (CBT) applied Contrastive Learning to the training of BERT for video captioning task.
Architecture
CBT consists of two streams: video and textual stream. The two streams extracts word and frame features respectively and those features are concatenated and pass through the cross-model transformer to make an estimation. The text stream uses pre-trained BERT, and the video part is trained through Contrastive Learning, if two frame features originated from the same video then positive otherwise negative.
Cross-Model Transformer
The cross-model transformer model takes video segment-text pairs as input and determines if the pairs match.