Definition

Dense prediction transformer (DPT) reassembles features at multiple resolutions unlike the Segmenter which only uses fixed-size patches.
Architecture
Reassemble

The resample operation reshapes the flatten feature vectors back to image-like dimensions, change the spatial and feature sizes using the convolutional layer followed by a convolution or deconvolutional layer. In the process, the CLS token is specially processed (ignored, added, or concatenated and projected)
Residual Convolutional Unit

The output of the transformer is concatenated to the fusion module after passing through the residual convolutional unit.
Fusion

The fusion module concatenates the feature of the previous layer and the residual convolution unit, double the size of the feature map, and bring the channel size back to the input size.