Definition

Hierarchical multi-modal encoder (HAMMER) is a model for moment localization tasks (identifying a short segment in a long video corpus that semantically matches a given text query). It simplified the task with Conditional Probability.

Architecture

Two-Stage MLVC (Moment Localization in Video Corpus)

The original problem is approximated to the simpler problem using the definition of Conditional Probability

Video Retrieval Task

The video retrieval task is trained through Contrastive Learning by promoting the correct video-query pairs.

Moment Localization in Single Video

The temporal localization task is trained as a classification problem with the three labels: begin, end, and other.

Hierarchical Visual Encoders

The frame encoder takes the frame sequence of a video clip and the query as input, and outputs the contextualized visual frame features. The CLS tokens of the frame encoder fed into the clip encoder, yielding a video representation