Definition

Dense prediction transformer (DPT) reassembles features at multiple resolutions unlike the Segmenter which only uses fixed-size patches.

Architecture

Reassemble

$Resample_{s} : R^{\frac{H}{p} \times \frac{W}{p} \times D} \to R^{\frac{H}{s} \times \frac{W}{s} \times \hat{D}}$

The resample operation reshapes the flatten feature vectors back to image-like dimensions, change the spatial and feature sizes using the $1 \times 1$ convolutional layer followed by a $3 \times 3$ convolution or deconvolutional layer. In the process, the CLS token is specially processed (ignored, added, or concatenated and projected)

Residual Convolutional Unit

The output of the transformer is concatenated to the fusion module after passing through the residual convolutional unit.

Fusion

The fusion module concatenates the feature of the previous layer and the residual convolution unit, double the size of the feature map, and bring the channel size back to the input size.

My Knowledge Base

Explorer

Dense Prediction Transformer

Definition

Architecture

Reassemble

Residual Convolutional Unit

Fusion

Graph View

Table of Contents

Backlinks