Condition-Invariant Semantic Segmentation

Read original: arXiv:2305.17349 - Published 7/23/2024 by Christos Sakaridis, David Bruggemann, Fisher Yu, Luc Van Gool

🔄

Overview

Adapting semantic segmentation networks to different visual conditions is crucial for robust perception in autonomous cars and robots.
Previous work has shown that most feature-level adaptation methods, like adversarial training, provide only marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization.
Motivated by these findings, the paper proposes a method called Condition-Invariant Semantic Segmentation (CISS) that leverages stylization to perform feature-level adaptation.

Plain English Explanation

Autonomous cars and robots need to be able to accurately identify the objects and scenes around them, even in different lighting conditions or weather. Semantic segmentation is a computer vision technique that can do this, but it can struggle when the visual conditions change.

Previous attempts to adapt semantic segmentation networks to different conditions, like adversarial training, have only provided small improvements. Instead, a simpler technique called stylization has been shown to work better.

The researchers in this paper decided to combine the benefits of stylization and feature-level adaptation. They developed a method called CISS that uses stylization to help the network extract features that are already invariant to the style of the input. This allows the network to focus on parsing these features, rather than having to deal with the specific style of the input.

The researchers implemented CISS on a state-of-the-art domain adaptation architecture and found that it sets a new record on the popular daytime-to-nighttime Cityscapes$to$Dark Zurich benchmark. It also achieved the second-best performance on the normal-to-adverse Cityscapes$to$ACDC benchmark. CISS was also able to generalize well to other unseen domains, like BDD100K-night and ACDC-night.

Technical Explanation

The paper proposes a novel feature-level adaptation method called Condition-Invariant Semantic Segmentation (CISS) that leverages stylization to align the internal network features extracted by the encoder from the original and stylized views of each input image. This is achieved through a new feature invariance loss that encourages the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features rather than abstracting from the specific style.

CISS is implemented on top of a state-of-the-art domain adaptation architecture and evaluated on two challenging benchmarks: Cityscapes$to$Dark Zurich (daytime-to-nighttime adaptation) and Cityscapes$to$ACDC (normal-to-adverse weather adaptation). The results show that CISS sets a new state of the art on the Cityscapes$to$Dark Zurich benchmark and achieves the second-best performance on the Cityscapes$to$ACDC benchmark. Additionally, CISS is shown to generalize well to other unseen domains, such as BDD100K-night and ACDC-night.

Critical Analysis

The paper provides a compelling solution to the challenge of adapting semantic segmentation networks to different visual conditions, which is crucial for the robustness of autonomous systems. By leveraging stylization to perform feature-level adaptation, the proposed CISS method achieves state-of-the-art results on two important benchmarks, demonstrating its effectiveness.

One potential limitation of the work is that it has only been evaluated on a limited number of target domains (nighttime and adverse weather). It would be valuable to assess the performance of CISS on a wider range of visual conditions, such as different seasons or camera perspectives, to further validate its generalization capabilities.

Additionally, the paper does not provide a detailed analysis of the internal feature representations learned by the encoder. Understanding how the feature invariance loss influences the network's feature extraction process could lead to further insights and potential improvements.

Overall, the CISS method represents a significant contribution to the field of domain adaptation for semantic segmentation, and the results suggest that it could be a valuable tool for building robust perception systems for autonomous cars and robots.

Conclusion

This paper introduces a novel feature-level adaptation method called Condition-Invariant Semantic Segmentation (CISS) that leverages stylization to align the internal network features extracted by the encoder from the original and stylized views of each input image. By encouraging the encoder to extract features that are invariant to the style of the input, CISS allows the decoder to focus on parsing these features, rather than having to adapt to the specific visual conditions.

The results of the paper demonstrate that CISS sets a new state of the art on the Cityscapes$to$Dark Zurich benchmark for daytime-to-nighttime adaptation, and achieves the second-best performance on the Cityscapes$to$ACDC benchmark for normal-to-adverse weather adaptation. Additionally, CISS has shown the ability to generalize well to other unseen domains, such as BDD100K-night and ACDC-night.

The success of CISS highlights the importance of feature-level adaptation for building robust semantic segmentation systems that can operate reliably in diverse visual conditions, a critical capability for autonomous cars, robots, and other applications that rely on computer vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Condition-Invariant Semantic Segmentation

Christos Sakaridis, David Bruggemann, Fisher Yu, Luc Van Gool

Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime Cityscapes$to$Dark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse Cityscapes$to$ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night. Code is publicly available at https://github.com/SysCV/CISS .

7/23/2024

Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation

Anqi Zhang, Guangyu Gao

Class Incremental Semantic Segmentation~(CISS), within Incremental Learning for semantic segmentation, targets segmenting new categories while reducing the catastrophic forgetting on the old categories.Besides, background shifting, where the background category changes constantly in each step, is a special challenge for CISS. Current methods with a shared background classifier struggle to keep up with these changes, leading to decreased stability in background predictions and reduced accuracy of segmentation. For this special challenge, we designed a novel background adaptation mechanism, which explicitly models the background residual rather than the background itself in each step, and aggregates these residuals to represent the evolving background. Therefore, the background adaptation mechanism ensures the stability of previous background classifiers, while enabling the model to concentrate on the easy-learned residuals from the additional channel, which enhances background discernment for better prediction of novel categories. To precisely optimize the background adaptation mechanism, we propose Pseudo Background Binary Cross-Entropy loss and Background Adaptation losses, which amplify the adaptation effect. Group Knowledge Distillation and Background Feature Distillation strategies are designed to prevent forgetting old categories. Our approach, evaluated across various incremental scenarios on Pascal VOC 2012 and ADE20K datasets, outperforms prior exemplar-free state-of-the-art methods with mIoU of 3.0% in VOC 10-1 and 2.0% in ADE 100-5, notably enhancing the accuracy of new classes while mitigating catastrophic forgetting. Code is available in https://andyzaq.github.io/barmsite/.

7/16/2024

Task-conditioned adaptation of visual features in multi-task policy learning

Pierre Marza, Laetitia Matignon, Olivier Simonin, Christian Wolf

Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.

5/7/2024

Mitigating Background Shift in Class-Incremental Semantic Segmentation

Gilhan Park, WonJun Moon, SuBeen Lee, Tae-Young Kim, Jae-Pil Heo

Class-Incremental Semantic Segmentation(CISS) aims to learn new classes without forgetting the old ones, using only the labels of the new classes. To achieve this, two popular strategies are employed: 1) pseudo-labeling and knowledge distillation to preserve prior knowledge; and 2) background weight transfer, which leverages the broad coverage of background in learning new classes by transferring background weight to the new class classifier. However, the first strategy heavily relies on the old model in detecting old classes while undetected pixels are regarded as the background, thereby leading to the background shift towards the old classes(i.e., misclassification of old class as background). Additionally, in the case of the second approach, initializing the new class classifier with background knowledge triggers a similar background shift issue, but towards the new classes. To address these issues, we propose a background-class separation framework for CISS. To begin with, selective pseudo-labeling and adaptive feature distillation are to distill only trustworthy past knowledge. On the other hand, we encourage the separation between the background and new classes with a novel orthogonal objective along with label-guided output distillation. Our state-of-the-art results validate the effectiveness of these proposed methods.

7/17/2024