Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal?

2405.14468

Published 5/24/2024 by Peter S'uken'ik, Marco Mondelli, Christoph Lampert

🧠

Abstract

Deep neural networks (DNNs) exhibit a surprising structure in their final layer known as neural collapse (NC), and a growing body of works has currently investigated the propagation of neural collapse to earlier layers of DNNs -- a phenomenon called deep neural collapse (DNC). However, existing theoretical results are restricted to special cases: linear models, only two layers or binary classification. In contrast, we focus on non-linear models of arbitrary depth in multi-class classification and reveal a surprising qualitative shift. As soon as we go beyond two layers or two classes, DNC stops being optimal for the deep unconstrained features model (DUFM) -- the standard theoretical framework for the analysis of collapse. The main culprit is a low-rank bias of multi-layer regularization schemes: this bias leads to optimal solutions of even lower rank than the neural collapse. We support our theoretical findings with experiments on both DUFM and real data, which show the emergence of the low-rank structure in the solution found by gradient descent.

Create account to get full access

Overview

Deep neural networks (DNNs) exhibit a surprising structure in their final layer called neural collapse (NC)
A growing body of research has investigated the propagation of neural collapse to earlier layers of DNNs, known as deep neural collapse (DNC)
Existing theoretical results are limited to special cases like linear models, two-layer networks, or binary classification
This paper focuses on non-linear models of arbitrary depth in multi-class classification and reveals a surprising qualitative shift in the nature of DNC

Plain English Explanation

As deep neural networks get better at tasks like image recognition or language processing, researchers have noticed an interesting pattern in how the final layer of the network behaves. This pattern is called "neural collapse," and it suggests that the network is doing something clever to simplify the way it represents information.

Researchers have been studying how this neural collapse can propagate back through the earlier layers of the network, a phenomenon called "deep neural collapse" (DNC). However, the existing theoretical work on DNC has been limited to fairly simple cases, like linear models or networks with only two layers.

In contrast, this paper looks at more complex, non-linear neural networks with many layers and multiple classes (e.g., a network that can recognize different types of animals). The researchers find that as soon as you move beyond two layers or two classes, the standard theoretical framework for analyzing DNC, called the "deep unconstrained features model" (DUFM), stops being the best way to understand what's happening.

The reason is that the regularization techniques used to train these networks (which help prevent overfitting) introduce a "low-rank bias" that leads to solutions with an even lower rank than what you'd expect from neural collapse alone. This low-rank structure emerges in both the theoretical analysis and in experiments on real-world data.

Technical Explanation

The paper focuses on understanding the propagation of neural collapse (NC) to earlier layers of deep neural networks, a phenomenon known as deep neural collapse (DNC). While existing theoretical results on DNC are limited to special cases like linear models, two-layer networks, or binary classification, this work examines the behavior of non-linear models with arbitrary depth in multi-class classification tasks.

The key finding is that as soon as we move beyond two layers or two classes, the standard theoretical framework for analyzing DNC, called the deep unconstrained features model (DUFM), stops being the optimal solution. The main reason for this is a low-rank bias introduced by multi-layer regularization schemes, which leads to solutions of even lower rank than what would be expected from neural collapse alone.

The paper supports these theoretical insights with experiments on both the DUFM and real-world datasets, which demonstrate the emergence of the low-rank structure in the solutions found by gradient descent optimization.

Critical Analysis

The paper provides a valuable contribution to the understanding of deep neural collapse by extending the analysis beyond the special cases explored in previous work. The finding that the standard DUFM framework breaks down in more complex settings is an important insight that highlights the need for more nuanced theoretical models to capture the behavior of deeper, multi-class neural networks.

The paper also notes that the low-rank bias introduced by regularization techniques is a potential culprit for this qualitative shift, but further research is needed to fully understand the underlying mechanisms.

One limitation of the work is that it focuses on the analysis of the final solution found by gradient descent, without delving into the dynamics of the training process. Exploring how the low-rank structure emerges during training could provide additional insights into the nature of deep neural collapse.

Additionally, the paper does not address the potential implications of these findings for the practical applications of deep learning. Understanding how the low-rank structure affects model performance, generalization, or interpretability could be an important area for future research.

Conclusion

This paper makes a significant contribution to the theoretical understanding of deep neural collapse by revealing a surprising qualitative shift in the nature of the phenomenon when moving beyond two-layer or two-class models. The key insight is that the low-rank bias introduced by common regularization techniques can lead to solutions with an even lower rank than what would be expected from neural collapse alone.

These findings underscore the need for more sophisticated theoretical frameworks to capture the complex behavior of deep neural networks, particularly in multi-class settings. Continued research in this direction could yield important insights into the inner workings of deep learning and help unlock its full potential across a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model

Hien Dang, Tho Tran, Tan Nguyen, Nhat Ho

The current paradigm of training deep neural networks for classification tasks includes minimizing the empirical risk that pushes the training loss value towards zero, even after the training error has been vanished. In this terminal phase of training, it has been observed that the last-layer features collapse to their class-means and these class-means converge to the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is termed as Neural Collapse (NC). To theoretically understand this phenomenon, recent works employ a simplified unconstrained feature model to prove that NC emerges at the global solutions of the training problem. However, when the training dataset is class-imbalanced, some NC properties will no longer be true. For example, the class-means geometry will skew away from the simplex ETF when the loss converges. In this paper, we generalize NC to imbalanced regime for cross-entropy loss under the unconstrained ReLU feature model. We prove that, while the within-class features collapse property still holds in this setting, the class-means will converge to a structure consisting of orthogonal vectors with different lengths. Furthermore, we find that the classifier weights are aligned to the scaled and centered class-means with scaling factors depend on the number of training samples of each class, which generalizes NC in the class-balanced setting. We empirically prove our results through experiments on practical architectures and dataset.

6/7/2024

cs.LG stat.ML

Progressive Feedforward Collapse of ResNet Training

Sicong Wang, Kuo Gai, Shihua Zhang

Neural collapse (NC) is a simple and symmetric phenomenon for deep neural networks (DNNs) at the terminal phase of training, where the last-layer features collapse to their class means and form a simplex equiangular tight frame aligning with the classifier vectors. However, the relationship of the last-layer features to the data and intermediate layers during training remains unexplored. To this end, we characterize the geometry of intermediate layers of ResNet and propose a novel conjecture, progressive feedforward collapse (PFC), claiming the degree of collapse increases during the forward propagation of DNNs. We derive a transparent model for the well-trained ResNet according to that ResNet with weight decay approximates the geodesic curve in Wasserstein space at the terminal phase. The metrics of PFC indeed monotonically decrease across depth on various datasets. We propose a new surrogate model, multilayer unconstrained feature model (MUFM), connecting intermediate layers by an optimal transport regularizer. The optimal solution of MUFM is inconsistent with NC but is more concentrated relative to the input data. Overall, this study extends NC to PFC to model the collapse phenomenon of intermediate layers and its dependence on the input data, shedding light on the theoretical understanding of ResNet in classification problems.

5/3/2024

cs.LG cs.AI

Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse

Yining Wang, Junjie Sun, Chenyue Wang, Mi Zhang, Min Yang

Recent studies have noted an intriguing phenomenon termed Neural Collapse, that is, when the neural networks establish the right correlation between feature spaces and the training targets, their last-layer features, together with the classifier weights, will collapse into a stable and symmetric structure. In this paper, we extend the investigation of Neural Collapse to the biased datasets with imbalanced attributes. We observe that models will easily fall into the pitfall of shortcut learning and form a biased, non-collapsed feature space at the early period of training, which is hard to reverse and limits the generalization capability. To tackle the root cause of biased classification, we follow the recent inspiration of prime training, and propose an avoid-shortcut learning framework without additional training complexity. With well-designed shortcut primes based on Neural Collapse structure, the models are encouraged to skip the pursuit of simple shortcuts and naturally capture the intrinsic correlations. Experimental results demonstrate that our method induces better convergence properties during training, and achieves state-of-the-art generalization performance on both synthetic and real-world biased datasets.

5/10/2024

cs.CV cs.LG

Kernel vs. Kernel: Exploring How the Data Structure Affects Neural Collapse

Vignesh Kothapalli, Tom Tirer

Recently, a vast amount of literature has focused on the Neural Collapse (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. In this paper, we provide a kernel-based analysis that does not suffer from this limitation. First, given a kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples' features (and consequently an NC1 metric). Then, we turn to focus on kernels associated with shallow NNs. First, we consider the NN Gaussian Process kernel (NNGP), associated with the network at initialization, and the complement Neural Tangent Kernel (NTK), associated with its training in the lazy regime. Interestingly, we show that the NTK does not represent more collapsed features than the NNGP for prototypical data models. As NC emerges from training, we then consider an alternative to NTK: the recently proposed adaptive kernel, which generalizes NNGP to model the feature mapping learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.

7/1/2024

cs.LG cs.AI cs.IT stat.ML