Definition

Detection Transformer (DETR) uses Transformer encoder-decoder architecture for the object detection problem.
Architecture

DETR extracts CNN features from the input image using a backbone CNN network. The feature map is flattened and fed into the encoder after adding positional encodings. The decoder takes the encoded image features and a set of learnable object queries as inputs. The decoder output embeddings correspond to the objects to be detected in the image. The output of the decoder is passed through two prediction heads: classification head, box head.