Definition

Explicitly separate appearance (spatial) and motion (temporal) in training, the features are combined at scoring stage. From the input video, model takes a single frame and feed it to spatial stream which has CNN structure. To the motion stream, feed pre-calculated optical flows, stacked as a channel, to learn temporal dynamics, independent of appearance.