Inverting Visual Representations with Detection Transformers (2025)

Jan Rathjens 1111Assert joint first authorship. ,Shirin Reyhanian 1111Assert joint first authorship. ,David Kappel 1,2222Assert joint last authorship. ,Laurenz Wiskott 1222Assert joint last authorship.
1Institute for Neural Computation (INI), Faculty of Computer Science,Ruhr University Bochum, Germany
2 Center for Cognitive Interaction Technology (CITEC), Faculty of Technology, Bielefeld University, Germany
{jan.rathjens, laurenz.wiskott}@rub.de, {shirin.reyhanian, david.kappel}@ini.rub.de

Abstract

Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many prior approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer, showing that this approach is efficient and feasible for transformer-based vision models. Through qualitative and quantitative evaluations of reconstructed images across model stages, we demonstrate critical properties of Detection Transformers, including contextual shape preservation, inter-layer correlation, and robustness to color perturbations, illustrating how these characteristics emerge within the model’s architecture. Our findings contribute to a deeper understanding of transformer-based vision models. The code for reproducing our experiments will be made available at github.com/wiskott-lab/inverse-detection-transformer.

1 Introduction

Recent advancements in deep neural networks (DNNs), particularly transformer-based architectures, have achieved remarkable results on vision tasks such as object detection, semantic segmentation, and image classification[4, 15, 1, 31, 29]. Despite their compelling performance, the internal mechanisms of these networks remain largely opaque, hindering a clear understanding of how predictions are made[30, 6, 14]. Enhancing network interpretability, i.e., understanding the mechanisms underlying the network’s functionality, is crucial for ensuring safety, optimizing performance, and identifying potential weaknesses.

Feature inversion, introduced by Dosovitskiy and Brox[3], is one of the first techniques developed to interpret the processing capabilities of DNNs for vision. Building on a substantial body of work on generating images from intermediate representations[5, 28, 18, 26], their method involved training an inverse network for each layer of a convolutional neural network (CNN) to reconstruct input images from intermediate representations. By analyzing these reconstructed images and their distinct characteristics from various layers, they uncovered important insights into the underlying mechanisms of CNNs.

While feature inversion provided valuable understanding of the CNN explored in their study, the approach has not seen widespread application to more advanced iterations of CNNs or transformer-based vision models. This is mainly due to the computational challenges inherent to feature inversion, as training a separate inverse model for each intermediate layer of interest becomes impractical for the large and complex DNNs of today.

In this work, we revisit feature inversion and extend it to a modern transformer-based vision DNN: the Detection Transformer (DETR). To address the significant computational challenges, we propose a modular approach to feature inversion. We invert distinct components or modules of DETR independently. This modular strategy significantly reduces the size of inverse networks required while maintaining the interpretability of reconstructed images. Notably, we observe that these inverse modules can be effectively applied across all intermediate layers, including those for which they were not explicitly optimized.

Leveraging these inverse modules, we qualitatively analyze reconstructed images from different stages of DETR. Hypotheses drawn from this qualitative evaluation are further validated through quantitative analyses, demonstrating key properties of DETR, such as contextual shape preservation, inter-layer correlation, and robustness to color perturbations.

2 Related work

In the context of DNNs for computer vision, synthesizing images from intermediate representations is a valuable technique for network interpretability. A central branch within synthesizing images for network interpretability is activation maximization[5, 28, 26, 20, 21]. Activation maximization generates images that activate specific network components, such as neurons, channels, or entire layers. While approaches vary in utilized techniques, the general goal is to synthesize images highlighting what specific parts of the network are responsive to. For example, synthesized images usually show simple features, such as edges, in lower layers, and more complex patterns in higher layers.

A related approach of synthesizing images concerns the feature inversion of intermediate representations within DNNs[18, 3]. Unlike activation maximization, which targets individual network components, feature inversion reconstructs input images from intermediate representations, enabling researchers to assess the information retained at each layer. This approach aids in interpreting network behavior by examining specific features in these reconstructions. For example, Dosovitskiy and Brox[3] argued that color information remains accessible in the top layer of AlexNet[13].

These techniques for image synthesis have primarily been applied to convolutional neural networks (CNNs). In recent years, transformer-based vision models have been introduced. The analysis of vision models based on transformer networks[4, 1, 12] has so far focused on various other techniques such as analyzing the robustness against various image perturbations (e.g., occlusions or natural adversarial examples) representational similarity, and loss landscape analysis[23, 24, 19, 2, 22, 16]. Only recently has activation maximization been applied to vision transformers[9]. To the best of our knowledge, our work is the first to synthesize images from intermediate representations using feature inversion within a transformer-based vision model.

3 Methods

Inverting Visual Representations with Detection Transformers (1)

To efficiently apply feature inversion to DETR, we trained separate inverse models for distinct modules of the architecture. Specifically, for each of DETR’s modules (backbone, encoder, decoder, and prediction head), we trained corresponding inverse models (backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, decoder1superscriptdecoder1\text{decoder}^{-1}decoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, prediction head1superscriptprediction head1\text{prediction head}^{-1}prediction head start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and detection1superscriptdetection1\text{detection}^{-1}detection start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) to reverse the path of information flow, as illustrated in Figure2.

For a technical description of our method, we first outline the default forward path of information flow in the DETR architecture. In the forward path, an input image 𝐗H0×W0×3𝐗superscriptsubscript𝐻0subscript𝑊03\mathbf{X}\in\mathbb{R}^{H_{0}\times W_{0}\times 3}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT is processed through the ResNet-50 [10] backbone, extracting feature maps and reducing the spatial dimensions by a factor of 32. Assuming H,W=H032,W032formulae-sequence𝐻𝑊subscript𝐻032subscript𝑊032H,W=\lceil\frac{H_{0}}{32}\rceil,\lceil\frac{W_{0}}{32}\rceilitalic_H , italic_W = ⌈ divide start_ARG italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 32 end_ARG ⌉ , ⌈ divide start_ARG italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 32 end_ARG ⌉, the backbone feature map is then flattened and linearly projected to produce the backbone forward embedding 𝐁H×W×256𝐁superscript𝐻𝑊256\mathbf{B}\in\mathbb{R}^{H\times W\times 256}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 256 end_POSTSUPERSCRIPT. Next, the DETR encoder processes 𝐁𝐁\mathbf{B}bold_B using six transformer blocks, generating contextualized encoder forward embeddings 𝐄H×W×256𝐄superscript𝐻𝑊256\mathbf{E}\in\mathbb{R}^{H\times W\times 256}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 256 end_POSTSUPERSCRIPT. The DETR decoder module takes 𝐄𝐄\mathbf{E}bold_E along with 100 learnable object queries 𝐐100×256𝐐superscript100256\mathbf{Q}\in\mathbb{R}^{100\times 256}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT 100 × 256 end_POSTSUPERSCRIPT, and through a series of self-attention and cross-attention layers, refines these object queries to produce decoder forward embeddings 𝐃100×256𝐃superscript100256\mathbf{D}\in\mathbb{R}^{100\times 256}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT 100 × 256 end_POSTSUPERSCRIPT. Finally, the prediction head utilizes 𝐃𝐃\mathbf{D}bold_D to generate object classification scores 𝐘cls100×Csubscript𝐘clssuperscript100𝐶\mathbf{Y}_{\text{cls}}\in\mathbb{R}^{100\times C}bold_Y start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 100 × italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C is number of object classes, and bounding box coordinates 𝐘bbox100×4subscript𝐘bboxsuperscript1004\mathbf{Y}_{\text{bbox}}\in\mathbb{R}^{100\times 4}bold_Y start_POSTSUBSCRIPT bbox end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 100 × 4 end_POSTSUPERSCRIPT.

For the inverse path, all inverse modules were trained using the MSE loss between their outputs and the corresponding forward path embeddings independent of each other. Target forward path embeddings were generated for the COCO 2017 dataset [8]. The pre-trained forward weights of DETR were kept fixed during the training process, and the inverse modules were trained in isolation.

The backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT takes the embedding 𝐁𝐁\mathbf{B}bold_B and uses a 6-layer CNN to reconstruct the original input image as 𝐗~H0×W0×3~𝐗superscriptsubscript𝐻0subscript𝑊03\mathbf{\tilde{X}}\in\mathbb{R}^{H_{0}\times W_{0}\times 3}over~ start_ARG bold_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT. The encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT takes 𝐄𝐄\mathbf{E}bold_E, and using the same architecture as the DETR encoder reconstructs the backbone inverse embedding 𝐁~H×W×256~𝐁superscript𝐻𝑊256\mathbf{\tilde{B}}\in\mathbb{R}^{H\times W\times 256}over~ start_ARG bold_B end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 256 end_POSTSUPERSCRIPT. The decoder1superscriptdecoder1\text{decoder}^{-1}decoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT receives zero-initialized image queries 𝐈H×W×256𝐈superscript𝐻𝑊256\mathbf{I}\in\mathbb{R}^{H\times W\times 256}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 256 end_POSTSUPERSCRIPT along with 𝐃𝐃\mathbf{D}bold_D, and refines these image queries gradually by employing a structure similar to the forward decoder but operating on image queries instead of object queries. It outputs the encoder inverse embeddings 𝐄~H×W×256~𝐄superscript𝐻𝑊256\mathbf{\tilde{E}}\in\mathbb{R}^{H\times W\times 256}over~ start_ARG bold_E end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 256 end_POSTSUPERSCRIPT. Finally, the prediction head1superscriptprediction head1\text{prediction head}^{-1}prediction head start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT takes [𝐘cls,𝐘bbox]subscript𝐘clssubscript𝐘bbox[\mathbf{Y}_{\text{cls}},\mathbf{Y}_{\text{bbox}}][ bold_Y start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT bbox end_POSTSUBSCRIPT ], and reconstructs the decoder inverse embeddings 𝐃~100×256~𝐃superscript100256\mathbf{\tilde{D}}\in\mathbb{R}^{100\times 256}over~ start_ARG bold_D end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 100 × 256 end_POSTSUPERSCRIPT using a multi-layer perceptron network.

With this final inverse module, we took an additional step by applying one-hot encoding suppression on the class scores, preserving only the predicted classes with the highest probabilities in 𝐘clssubscript𝐘cls\mathbf{Y}_{\text{cls}}bold_Y start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT, while retaining their bounding boxes. These were fed to the final detection1superscriptdetection1\text{detection}^{-1}detection start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which reconstructs 𝐃~~𝐃\mathbf{\tilde{D}}over~ start_ARG bold_D end_ARG. DETR can use a ”no object” class to handle images where the number of objects is fewer than 100 object queries. During training, if a query does not match any ground-truth object, it is classified as ”no object” (i.e., background). Since not all possible object categories are included in COCO’s ground-truth annotations, this background class may also encompass objects that are unknown to DETR. The detection1superscriptdetection1\text{detection}^{-1}detection start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT was designed to analyze the model’s behavior when only the highest-confidence classes, including ”no object,” are provided for inversion. In this way, reconstructions can partially preserve the scene’s complexity by assuming that something should occupy the corresponding predicted locations.

4 Results

Inverting Visual Representations with Detection Transformers (2)
Inverting Visual Representations with Detection Transformers (3)

4.1 Reconstructing modules’ embeddings

After training a set of inverse modules, we conducted a qualitative evaluation of the reconstructed images from various stages of DETR, as shown in Figure2. We observed that structural and spatial information are generally well preserved in the reconstructions from backbone embeddings. Moreover, color shifts emerged as early as the encoder stage, which becomes more pronounced in the decoder. These color shifts resulted in reconstructed objects appearing closer to prototypical colors associated with their classes. For example, red was often assigned to buses, stop signs, and apples, green to potted plants, and orange to oranges. Beyond just color, the decoder stage also exhibited contextual shifts, where objects were reconstructed with prototypical semantic features even if they were not present in the original image. For instance, the presence of a tie often resulted in the model reconstructing the person wearing a suit, despite the original image not containing that. This indicates that DETR’s deeper layers incorporate semantic features and contextual cues that align with learned object associations.Our analysis shows that most irrelevant information for object detection from the background scene is effectively removed after the decoder stage, allowing DETR to focus on features crucial for detection while discarding less useful details.

This selective filtering can also explain some detection errors. By comparing reconstructions before and after the decoder stage, we observed that false negatives can occur when certain objects, deemed irrelevant by the model, are entirely removed, leaving no trace in the reconstructions. Conversely, for false positive detections, the reconstructions showed cases where irrelevant features associated with an incorrect class are amplified, leading to misclassifications. For additional reconstruction results on such examples, please refer to Section7 in the supplementary material.

When analyzing reconstructions from grayscale images, we observed that removing color information did not significantly impair DETR’s ability to detect objects. The reconstructions still displayed colors typical of the detected classes, emerging after the encoder.

Further analysis of reconstructions from decoder and prediction head stages showed that along this forward and inversion path it has been learned to encode not only classes but also prototypical shapes and spatial relationships. For example, human body parts and their spatial relations were contextually well reconstructed. Even if the figures are not anatomically perfect, the reconstructions captured recognizable human forms and varied poses differing from the original image. Similarly, objects like buses were reconstructed with distinctive features, sometimes from angles that differ from the input, demonstrating the model’s ability to leverage class and bounding box information to generate plausible representations.

In the last experiment of this section, we applied one-hot encoding suppression on the class scores by preserving only the class with the highest confidence for each object query while retaining all bounding box predictions. This allowed us to investigate how the removal of lower-confidence class information affects the reconstruction. By limiting the class information to only the most confident predictions, we could observe the impact of class uncertainty on the reconstructed output and the resulting information loss. In some cases, this led to hallucinations of semantically relevant objects, such as generating additional chairs around a potted plant that were not present in the original image Figure2.

The diagram presented in Figure3 shows the MSE loss of reconstructions across different stages of the DETR model, suggesting that the most significant information loss occurs within the decoder. This aligns with the model’s architecture, as the decoder has the role of transforming object queries into high-level abstract representations tailored for object classification and localization.

4.2 Contextualizing reconstructions

Similar to Dosovitskiy and Brox[3], we contextualized our method for network interpretability by comparing reconstructions obtained from our default model with reconstructions obtained from a fine-tuned DETR fine-tuned for reconstruction. For this purpose, we started with a pre-trained DETR, an inverse backbone, an inverse encoder, and an inverse decoder and retrained all the parameters of both DETR and the inverse modules end-to-end. Specifically, we used a weighted combination of the detection performance loss LHungariansubscript𝐿HungarianL_{\text{Hungarian}}italic_L start_POSTSUBSCRIPT Hungarian end_POSTSUBSCRIPT with default hyperparameters as defined by Carion et al.[1], and a reconstruction loss LMSEsubscript𝐿MSEL_{\text{MSE}}italic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT defined as the MSE between an input image and its reconstruction from the decoder embedding, comparable to the method proposed by Rathjens and Wiskott[25]. The total loss is expressed as:

L=λLMSE+(1λ)LHungarian,λ[0,1]formulae-sequence𝐿𝜆subscript𝐿MSE1𝜆subscript𝐿Hungarian𝜆01L=\lambda L_{\text{MSE}}+(1-\lambda)L_{\text{Hungarian}},\quad\lambda\in[0,1]italic_L = italic_λ italic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_L start_POSTSUBSCRIPT Hungarian end_POSTSUBSCRIPT , italic_λ ∈ [ 0 , 1 ](1)

The hyperparameter λ𝜆\lambdaitalic_λ allows for a trade-off between the two losses. For example, when λ=0𝜆0\lambda=0italic_λ = 0, the model is optimized solely for detection performance, while setting λ=1𝜆1\lambda=1italic_λ = 1 optimizes the model exclusively for reconstruction performance.

Notably, this procedure differed from our default training approach in two key ways. First, it involves updating DETR’s parameters alongside the inverse modules. Second, images are reconstructed directly from the decoder embedding, rather than optimizing the inverse modules independently as described in Section3.

Inverting Visual Representations with Detection Transformers (4)

Figure4 illustrates reconstructions for various images with fine-tuned models as well as our default model. The left column displays the original images, while the subsequent columns show reconstructions generated with progressively smaller λ𝜆\lambdaitalic_λ values. Across all examples, a consistent pattern emerges: for high λ𝜆\lambdaitalic_λ values, the reconstructions exhibit high quality, faithfully capturing details and colors. As λ𝜆\lambdaitalic_λ decreases, reconstruction quality deteriorates, increasing blurriness, loss of detail, and noticeable color shifts. This degradation is particularly pronounced in the last two columns. Interestingly, there are apparent differences in the last two columns, as the λ=0𝜆0\lambda=0italic_λ = 0 column introduces a gray tone to reconstructions. While both DETRs corresponding to these columns were optimized purely for object detection, the inverse models were optimized differently. One represents our default approach of training inverse modules independently, whereas the other was trained using the method described in Equation1. These differences highlight the significant impact of the two optimization techniques.

We quantitatively analyzed this pattern in Figure5, which shows the MSE alongside the average precision (AP) for our default and fine-tuned models. The results align with the qualitative assessment of the reconstructed images: as λ𝜆\lambdaitalic_λ decreases, the reconstruction error increases. Notably, as λ𝜆\lambdaitalic_λ decreases, object detection performance improves, highlighting a trade-off between reconstruction quality and object detection performance in DETR.

Inverting Visual Representations with Detection Transformers (5)

From the contextualization of our approach, we draw the following conclusions. Firstly, feature inversion is a viable method for interpreting DETR, as reconstruction performance is inherently linked to object detection performance, meaning that the properties of reconstructed images reflect the information processing within DETR. Secondly, training inverse modules independently offers distinct advantages over optimizing the inverse model to reconstruct images from the decoder embedding directly. While the latter approach achieves better overall reconstruction performance, it diminishes the contextual characteristics of the reconstructions. Visual inspection of these images, as shown in Figure4, along with the strong reconstruction performance of an average-guessing baseline (see Figure3), suggests that this optimization tends to push reconstructions toward a grayish average image of the dataset, thereby reducing interpretability.

4.3 Case study: coloring objects

Having contextualized our method, we investigated the role of color information in DETR in greater detail. To this end, we recolored specific objects in input images. We evaluated the influence of these modifications on reconstructions from intermediate embeddings at different stages within the network and the influence on object detection performance. For recoloring, we used the segmentation annotations from the COCO dataset to apply five different color filters in the HSV color space to various object categories. For each filter, we adjusted the hue of specified objects to red, green, or blue or shifted the hue values by 120 or 240 degrees. We then evaluated the reconstructions and the object detection performance with AP. Since segmentation annotations in COCO are most accurate for large objects, all evaluations were conducted on objects with segmentation masks of at least 962superscript96296^{2}96 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pixels.

Figure6 illustrates recolored images and the influence on DETR with several examples (additional examples are provided in Section7 of the supplementary material). Each row presents an example from a different object category (from top to bottom, stop sign, bear, apple, and bus) with a different color filter. For each example, images are shown at different processing stages of DETR, displaying reconstructions from different embeddings. We observed that color changes were preserved in the backbone embedding for all objects and filters but faded or disappeared almost entirely in the encoder embedding. There was almost no color information from the input in the decoder embedding. Instead, colors shifted towards prototypical representations (e.g., red for the stop sign and bus, brown for the bear, or red and yellow for the apples). This finding contrasts sharply with previous results of inversion of feature analysis in CNNs, where color information was largely preserved throughout all layers [3].

Inverting Visual Representations with Detection Transformers (6)

While it is unclear whether DETR adjusts or removes colors entirely, resulting in prototypical fills, these transformations suggest that the model exhibits robustness to color changes. This robustness is further supported quantitatively in the last column, with the exception of the apple category, where we show AP performance for unchanged objects compared to all color filters across rows for the respective objects. The graphs confirm this observation, as AP values remain relatively consistent regardless of color changes or object category.

Building on our observations of the color robustness of DETR, we analyzed the individual influence of different color perturbations. To do this, we applied the previously described color perturbations to images and reconstructed them from various stages of DETR. Figure7 provides an example of reconstructions for one image under different color perturbations. Consistent with earlier observations, color perturbations are noticeable in reconstructions from the backbone embedding but gradually diminish in reconstructions from later stages of DETR. Interestingly, reconstructions from the encoder and decoder embeddings converge to the same color and not to individual prototypical colors, becoming increasingly similar. Additionally, while reconstructions from the backbone and encoder embeddings are broadly similar in shape, objects in the decoder reconstructions show shape distortions that vary depending on the applied filter.

Inverting Visual Representations with Detection Transformers (7)

We quantified this effect by calculating the average pairwise MSE between reconstructions for perturbed and unperturbed images at each embedding stage. Figure8 presents these results. We observed that the average pairwise MSE decreases progressively from the input to reconstructions derived from the backbone and encoder embeddings, reflecting increasing similarity. However, the MSE returns to input levels in reconstructions from decoder embeddings. This observation aligns with our qualitative analysis, confirming that reconstructions converge to the same or similar colors the farther they progress through the DETR architecture. The increase in average pairwise MSE at the decoder stage is likely not due to color divergence but caused by distortions in object shapes.

Inverting Visual Representations with Detection Transformers (8)

4.4 Reconstructing intermediate representations

In contrast to the design philosophy of CNNs – where intermediate layers often vary in dimensionality (with exceptions, e.g., intermediate layers within ResNet building blocks[10]) – transformer-based DNNs use a consistent dimensionality across intermediate layers within their encoders and decoders. This consistency enables using all intermediate representations as inputs to inverse modules, even if the inverse module is optimized for a different representation. For instance, all intermediate encoder representations can be fed into encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT even though encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is optimized for the encoder embedding.

Using our inverse modules, we leveraged this feature to evaluate DETR’s intermediate encoder and decoder representations. Figure9 shows an illustrative example (additional examples are provided in Section8 of the supplementary material). Specifically, the first column depicts reconstructions obtained by feeding the encoder’s intermediate representations into backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. In the second column, the same intermediate representations were passed through encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and then through backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. The last column displays results for intermediate decoder representations passed through decoder1superscriptdecoder1\text{decoder}^{-1}decoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, followed by encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Inverting Visual Representations with Detection Transformers (9)

As expected, we obtained the best reconstruction performance for the layer that each inverse module was trained on: encoder input (backbone embedding) for backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, encoder layer six (encoder embedding) for encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and decoder layer six (decoder embedding) for decoder1superscriptdecoder1\text{decoder}^{-1}decoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. The quality of reconstructions gradually decreased as we moved farther away from the layers the inverse modules were optimized for, a pattern particularly evident with the input to decoder1superscriptdecoder1\text{decoder}^{-1}decoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT since object queries initially hold values independent of the input image.

Despite this degradation in quality, we generally observed strong shape preservation between intermediate layers, particularly when feeding intermediate encoder representations to encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Most variations in backbone1superscriptbackbone1\text{backbone}^{-1}backbone start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and encoder1superscriptencoder1\text{encoder}^{-1}encoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT manifest as color shifts, while reconstructions from decoder1superscriptdecoder1\text{decoder}^{-1}decoder start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT show greater stability in color than shape. The overall stability in reconstructions across layers is noteworthy, as inverse modules might be expected to produce only noisy outputs when applied to intermediate embeddings.

From these observations, we draw two conclusions. Firstly, intermediate embeddings in transformer-based DNNs change gradually across layers, as suggested by Raghu et al.for ViTs[24] or by Lui et al.for LLMs[17]. Secondly, inverse modules are practical tools for interpreting transformer-based DNNs, as a single inverse module can be applied across multiple layers, eliminating the need to train separate inverse modules for each layer.

5 Discussion

In this work, we took a closer look at DETR by inverting the three main modules of the network architecture – its backbone, encoder, and decoder. We used these inverse modules to study DETR’s, by reconstructing input images from different stages. Following the premise that reconstruction quality reveals the information present in each layer, we evaluated the reconstructed images for distinctive features. We inferred that DETR is preserving object representations within its encoder and decoder, is robust to color changes, and that features evolve gradually across the architecture.

Our findings align with previous analyses of vision transformers (ViT), which have been studied in the literature using different tools. For instance, the gradual change of representations was reported for ViT by Raghu et al.[24]. Additionally, the robustness of ViT to various image perturbations[19, 23] is comparable to our findings of DETR’s robustness to color perturbations. This suggests that the properties we observed for DETR and those reported for ViT are not specific to these architectures but may reflect general patterns in transformer-based DNNs for vision.

Our approach draws inspiration from Dosovitskiy and Brox’s work[3], where they inverted intermediate representations in AlexNet. Compared to their study, we find that DETR is more shape- but less color-preserving than AlexNet. These differences align with recognized distinctions between CNNs and transformer-based vision models in the literature[24, 27, 12].An important difference between our work and Dosovitskiy and Brox’s is that we did not train separate inverse models for each intermediate layer. Instead, we inverted DETR’s main components, allowing us to perform similar analyses with fewer inverse models, thanks to the consistent dimensionality across transformer layers and the gradual evolution of features within the architecture. This efficiency suggests that our method is well-suited for studying transformer-based DNNs in vision tasks and opens pathways for future interpretability studies, such as exploring newer versions of DETR or ViT.

While our approach is efficient, a primary limitation of feature inversion, in general, is that properties observed across reconstructed images must be carefully interpreted. This caution arises from the uncertainty of whether a specific property originates from a transformer module itself or from its inverse counterpart. For instance, whether the prototypical colors observed in reconstructions are introduced during the forward pass through the transformer model or added by the inverse module to minimize the reconstruction error is unclear.

The success of our modular inversion method in DETR could have implications beyond computer vision. For instance, in computational neuroscience, generative models of episodic memory require the integration of both discriminative and generative processes[7]. Future iterations of such models might unify a transformer-based DNN and its inverse within a single architecture. Similarly, transformer-based DNNs could be promising candidates for biologically plausible learning architectures, as they can leverage local reconstruction losses[11].

Acknowledgments

This work was supported by a grant from the German Research Foundation (DFG),”Constructing scenarios of the past: A new framework in episodic memory”, FOR 2812, project number 419039588, P5 (L.W.).

References

  • Carion etal. [2020]Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.End-to-End Object Detection with Transformers, 2020.arXiv:2005.12872 [cs].
  • Chefer etal. [2021]Hila Chefer, Shir Gur, and Lior Wolf.Transformer Interpretability Beyond Attention Visualization.In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, Nashville, TN, USA, 2021. IEEE.
  • Dosovitskiy and Brox [2016]Alexey Dosovitskiy and Thomas Brox.Inverting Visual Representations with Convolutional Networks.In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4829–4837, Las Vegas, NV, USA, 2016. IEEE.
  • Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.In International Conference on Learning Representations, 2020.
  • Erhan etal. [2009]Dumitru Erhan, Y. Bengio, Aaron Courville, and Pascal Vincent.Visualizing Higher-Layer Features of a Deep Network.Technical Report, Univeristé de Montréal, 2009.
  • Fan etal. [2021]Fenglei Fan, Jinjun Xiong, Mengzhou Li, and Ge Wang.On Interpretability of Artificial Neural Networks: A Survey, 2021.arXiv:2001.02522.
  • Fayyaz etal. [2022]Zahra Fayyaz, Aya Altamimi, Carina Zoellner, Nicole Klein, OliverT. Wolf, Sen Cheng, and Laurenz Wiskott.A Model of Semantic Completion in Generative Episodic Memory.Neural Computation, 34(9):1841–1870, 2022.
  • Fleet etal. [2014]D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars.Microsoft COCO: common objects in context. ECCV 2014. LNCS, vol. 8693, 2014.
  • Ghiasi etal. [2022]Amin Ghiasi, Hamid Kazemi, Steven Reich, Eitan Borgnia, Manli Shu, Micah Goldblum, AndrewGordon Wilson, and Tom Goldstein.What do Vision Transformers Learn? A Visual Exploration.2022.
  • He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep Residual Learning for Image Recognition.In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA, 2016. IEEE.
  • Kappel etal. [2023]David Kappel, KhaleelullaKhan Nazeer, CabrelTeguemne Fokam, Christian Mayr, and Anand Subramoney.Block-local learning with probabilistic latent representations, 2023.arXiv:2305.14974 [cs].
  • Khan etal. [2022]Salman Khan, Muzammal Naseer, Munawar Hayat, SyedWaqas Zamir, FahadShahbaz Khan, and Mubarak Shah.Transformers in Vision: A Survey, 2022.arXiv:2101.01169.
  • Krizhevsky etal. [2012]Alex Krizhevsky, Ilya Sutskever, and GeoffreyE Hinton.ImageNet Classification with Deep Convolutional Neural Networks.In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
  • Li etal. [2022]Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Xiao Zhang, Ji Liu, Jiang Bian, and Dejing Dou.Interpretable Deep Learning: Interpretation, Interpretability, Trustworthiness, and Beyond, 2022.arXiv:2103.10689.
  • Li etal. [2023a]Yong Li, Naipeng Miao, Liangdi Ma, Feng Shuang, and Xingwen Huang.Transformer for object detection: Review and benchmark.Engineering Applications of Artificial Intelligence, 126:107021, 2023a.
  • Li etal. [2023b]Yiran Li, Junpeng Wang, Xin Dai, Liang Wang, Chin-ChiaMichael Yeh, Yan Zheng, Wei Zhang, and Kwan-Liu Ma.How Does Attention Work in Vision Transformers? A Visual Analytics Attempt, 2023b.arXiv:2303.13731.
  • Liu etal. [2023]Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen.Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time.In Proceedings of the 40th International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.ISSN: 2640-3498.
  • Mahendran and Vedaldi [2014]Aravindh Mahendran and Andrea Vedaldi.Understanding Deep Image Representations by Inverting Them, 2014.arXiv:1412.0035.
  • Naseer etal. [2021]MuhammadMuzammal Naseer, Kanchana Ranasinghe, SalmanH Khan, Munawar Hayat, Fahad ShahbazKhan, and Ming-Hsuan Yang.Intriguing Properties of Vision Transformers.In Advances in Neural Information Processing Systems, pages 23296–23308. Curran Associates, Inc., 2021.
  • Nguyen etal. [2016]Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune.Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.
  • Olah etal. [2017]Chris Olah, Alexander Mordvintsev, and Ludwig Schubert.Feature Visualization.Distill, 2(11):e7, 2017.
  • Park and Kim [2022]Namuk Park and Songkuk Kim.How Do Vision Transformers Work?, 2022.arXiv:2202.06709.
  • Paul and Chen [2021]Sayak Paul and Pin-Yu Chen.Vision Transformers are Robust Learners, 2021.arXiv:2105.07581.
  • Raghu etal. [2021]Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy.Do Vision Transformers See Like Convolutional Neural Networks?2021.
  • Rathjens and Wiskott [2024]Jan Rathjens and Laurenz Wiskott.Classification and Reconstruction Processes in Deep Predictive Coding Networks: Antagonists or Allies?, 2024.arXiv:2401.09237 [cs].
  • Springenberg etal. [2015]JostTobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller.Striving for Simplicity: The All Convolutional Net, 2015.arXiv:1412.6806.
  • Tuli etal. [2021]Shikhar Tuli, Ishita Dasgupta, Erin Grant, and ThomasL. Griffiths.Are Convolutional Neural Networks or Transformers more like human vision?, 2021.arXiv:2105.07197 [cs].
  • Zeiler and Fergus [2014]MatthewD. Zeiler and Rob Fergus.Visualizing and Understanding Convolutional Networks.In Computer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer International Publishing.
  • Zhang etal. [2022]Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, LionelM. Ni, and Heung-Yeung Shum.DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, 2022.arXiv:2203.03605 [cs].
  • Zhang and Zhu [2018]Quanshi Zhang and Song-Chun Zhu.Visual Interpretability for Deep Learning: a Survey, 2018.arXiv:1802.00614.
  • Zhu etal. [2021]Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai.Deformable DETR: Deformable Transformers for End-to-End Object Detection, 2021.arXiv:2010.04159 [cs].

\thetitle

Supplementary Material

6 Extended reconstructions from DETR stages

Inverting Visual Representations with Detection Transformers (10)
Inverting Visual Representations with Detection Transformers (11)
Inverting Visual Representations with Detection Transformers (12)

7 Extended reconstructions for color perturbations

Inverting Visual Representations with Detection Transformers (13)

8 Extended reconstructions from intermediate layers

Inverting Visual Representations with Detection Transformers (14)

Inverting Visual Representations with Detection Transformers (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Corie Satterfield

Last Updated:

Views: 5549

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Corie Satterfield

Birthday: 1992-08-19

Address: 850 Benjamin Bridge, Dickinsonchester, CO 68572-0542

Phone: +26813599986666

Job: Sales Manager

Hobby: Table tennis, Soapmaking, Flower arranging, amateur radio, Rock climbing, scrapbook, Horseback riding

Introduction: My name is Corie Satterfield, I am a fancy, perfect, spotless, quaint, fantastic, funny, lucky person who loves writing and wants to share my knowledge and understanding with you.