Towards Stable Machine Learning Model Retraining via Slowly Varying Sequences

Read original: arXiv:2403.19871 - Published 5/24/2024 by Dimitris Bertsimas, Vassilis Digalakis Jr, Yu Ma, Phevos Paschalidis

📈

Introduction

The paper discusses the importance of interpretability and stability of machine learning models when they are updated with new data. It highlights the challenges faced when models provide drastically different insights after retraining, which can lead to skepticism and hesitation in adoption, especially in high-stakes decision areas like healthcare.

The authors propose a framework for stabilizing machine learning model structures during retraining, formulating it as a mixed-integer optimization-based algorithm. This algorithm aims to improve upon existing greedy approaches by learning stable structures across multiple modeling techniques, including regression, classification trees, and gradient-boosted trees.

The main contributions of the paper are:

Developing a general framework for stabilizing machine learning model structures during retraining, formulated as a mixed-integer optimization-based algorithm.
Demonstrating a significant boost in model stability with small computational overhead and controlled sacrifice of performance.
Showing the applicability of the proposed framework in a real-world production pipeline used in a large US hospital for mortality prediction.

The paper is organized as follows: Section 2 outlines the problem setting and formulation of the retraining setting, Section 3 presents numerical experimental results on a real-world case study, and Section 4 discusses limitations and concludes the study.

A Methodology for Retraining Machine Learning Models

The provided text discusses the problem setting and formulation for retraining machine learning models on new data while ensuring stability over time. Key points:

The goal is to find a sequence of models f1, f2, ..., fB that minimize predictive error on each batch of data Db while ensuring consecutive models do not vary significantly (measured by a stability metric d).
This bi-objective optimization problem is first formulated as a mixed-integer program by pre-computing candidate model sets for each batch and selecting one model per batch to form the sequence.
It is shown that this model selection problem can be reduced to a shortest path problem on a graph and solved in polynomial time.
Two approaches for calculating the model distance d are discussed: structural distances (comparing model parameters/structure) and feature importance-based distances.
Finally, an adaptive retraining strategy is proposed - when a new batch DB+1 arrives, instead of greedily updating to the closest model to the current fB, construct an entirely new sequence f1', f2', ..., fB+1' on the updated dataset DB+1. This achieves stability over longer horizons with multiple updates.

The text aims to provide a framework for gradually updating models on new data while controlling deviations from previous models to ensure a stable updating process over time.

Numerical Experiments

This section discusses a case study on patient mortality risk prediction using a large hospital dataset from Connecticut. The key points are:

Data and Preprocessing:

The dataset contains daily patient data from January 2018 to May 2022, with 168,815 patients and 865,954 patient-day records.
The data is divided into training, validation, and testing sets, with the last 16 months (September 2021 to December 2022) used for testing.

Research Questions:

Evaluate the intra-sequence stability (pairwise distance between adjacent models) and inter-sequence stability (distance between final models of independent sequences) of the slowly varying machine learning (SVML) methodology.
Determine the ideal update frequency (interval length) for SVML.

Experimental Methodology:

For each batch of data intervals, pre-computed candidate models are generated.
Slowly varying sequences of models are constructed based on different accuracy tolerances, along with a "greedy sequence" of best-performing models.
The area under the receiver operating characteristic curve (AUC) and model distances are evaluated on the testing set.

Key Findings:

A 3-month interval length was found to be optimal for SVML, balancing performance and stability.
SVML sequences achieved similar AUC performance to greedy sequences but with significantly improved model stability (smaller pairwise distances between adjacent models).
SVML exhibited better inter-sequence stability compared to the greedy approach, consistently selecting models with stable feature importances.

No details on software implementation or experimental setup were provided.

Conclusion and Limitations

The paper proposes a method called slowly varying machine learning (SVML), which is a mixed integer optimization algorithm that trains globally optimal machine learning models across different data batch updates to retain model structural stability. The key idea is to improve model stability by imposing structural similarities via learning stability by process.

The authors acknowledge that retraining models may not always occur sequentially by incorporating past historical data, but rather by using only the most recent years' data to avoid data distributional shift. They also highlight the need to consider the computational capacities of the problem and apply efficient speed-ups to make the method scalable.

In the existing case study, a significant portion of time is spent choosing medically-sound candidates to form a candidate pool, taking up to 2 weeks. Additional challenges may arise when the frequency of data transfer changes, affecting the data updating process.

The proposed SVML algorithm is presented as a solution that provides significant stability improvements over models focused solely on performance, without considering stability. The authors demonstrate the applicability and versatility of their method through a real-world healthcare case study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

Towards Stable Machine Learning Model Retraining via Slowly Varying Sequences

Dimitris Bertsimas, Vassilis Digalakis Jr, Yu Ma, Phevos Paschalidis

We consider the task of retraining machine learning (ML) models when new batches of data become available. Existing methods focus largely on greedy approaches to find the best-performing model for each batch, without considering the stability of the model's structure across retraining iterations. In this study, we propose a methodology for finding sequences of ML models that are stable across retraining iterations. We develop a mixed-integer optimization formulation that is guaranteed to recover Pareto optimal models (in terms of the predictive power-stability trade-off) and an efficient polynomial-time algorithm that performs well in practice. We focus on retaining consistent analytical insights - which is important to model interpretability, ease of implementation, and fostering trust with users - by using custom-defined distance metrics that can be directly incorporated into the optimization problem. Our method shows stronger stability than greedily trained models with a small, controllable sacrifice in predictive power, as evidenced through a real-world case study in a major hospital system in Connecticut.

5/24/2024

📊

On the Stability of Iterative Retraining of Generative Models on their own Data

Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, Gauthier Gidel

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

4/3/2024

➖

Overcoming the Stability Gap in Continual Learning

Md Yousuf Harun, Christopher Kanan

Pre-trained deep neural networks (DNNs) are being widely deployed by industry for making business decisions and to serve users; however, a major problem is model decay, where the DNN's predictions become more erroneous over time, resulting in revenue loss or unhappy users. To mitigate model decay, DNNs are retrained from scratch using old and new data. This is computationally expensive, so retraining happens only once performance has significantly decreased. Here, we study how continual learning (CL) could potentially overcome model decay in large pre-trained DNNs and also greatly reduce computational costs for keeping DNNs up-to-date. We identify the ``stability gap'' as a major obstacle in our setting. The stability gap refers to a phenomenon where learning new data causes large drops in performance for past tasks before CL mitigation methods eventually compensate for this drop. We test two hypotheses for why the stability gap occurs and identify a method that vastly reduces this gap. In large-scale experiments for both easy and hard CL distributions (e.g., class incremental learning), we demonstrate that our method reduces the stability gap and greatly increases computational efficiency. Our work aligns CL with the goals of the production setting, where CL is needed for many applications.

5/21/2024

Distilled Datamodel with Reverse Gradient Matching

Jingwen Ye, Ruonan Yu, Songhua Liu, Xinchao Wang

The proliferation of large-scale AI models trained on extensive datasets has revolutionized machine learning. With these models taking on increasingly central roles in various applications, the need to understand their behavior and enhance interpretability has become paramount. To investigate the impact of changes in training data on a pre-trained model, a common approach is leave-one-out retraining. This entails systematically altering the training dataset by removing specific samples to observe resulting changes within the model. However, retraining the model for each altered dataset presents a significant computational challenge, given the need to perform this operation for every dataset variation. In this paper, we introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. During the offline training phase, we approximate the influence of training data on the target model through a distilled synset, formulated as a reversed gradient matching problem. For online evaluation, we expedite the leave-one-out process using the synset, which is then utilized to compute the attribution matrix based on the evaluation objective. Experimental evaluations, including training data attribution and assessments of data quality, demonstrate that our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.

4/23/2024