Definition

Video-audio-text transformer (VATT) designed to process and understand information from video, audio, and text simultaneously.

Architecture

VATT is based on the Transformer architecture. It consists of three separate encoders for video, audio, and text inputs, followed by a joint multimodal encoder. Each modality feature is embedded to a same-sized vector, and fed into a transformer encoder. The model is trained through multimodal Contrastive Learning.

Multimodal Contrastive Learning

The loss of the model consists of the sum of the loss of video and audio, and the loss of text and video. For the text modality, all word embeddings in the sentence are summed before the comparison.