Random Forests for time-fixed and time-dependent predictors: The DynForest R package

Read original: arXiv:2302.02670 - Published 4/12/2024 by Anthony Devaux (BPH, GIGH, UNSW), C'ecile Proust-Lima (BPH), Robin Genuer (BPH)

🎲

Overview

The R package DynForest implements random forests for predicting continuous, categorical, or time-to-event outcomes using time-fixed and time-dependent predictors.
DynForest can handle time-dependent predictors that are endogenous (impacted by the outcome process), measured with error, and observed at subject-specific times.
The package uses flexible linear mixed models to summarize time-dependent predictors into individual features for the tree-building process.
DynForest returns predictions for continuous, categorical, or survival outcomes, as well as variable importance and minimal depth measures.
The paper provides step-by-step examples for fitting random forests using DynForest.

Plain English Explanation

DynForest is an R package that helps researchers make predictions about different types of outcomes, like continuous variables, categories, or the time it takes for an event to occur. What sets DynForest apart is its ability to work with predictors that change over time and may be influenced by the outcome itself.

For example, imagine you're studying how people's blood pressure changes over time. Blood pressure is the outcome, and factors like diet, exercise, and medication use are the predictors. But those predictors can also be affected by a person's blood pressure, creating a complex, intertwined relationship.

DynForest uses advanced statistical techniques to handle these time-dependent predictors and make accurate predictions. It can tell you things like the average blood pressure for a group, the most likely blood pressure category a person will fall into, or the probability of developing high blood pressure over time.

The package also provides tools to identify the most important predictors or groups of predictors, helping researchers understand the key factors driving the outcomes they're studying. This can be valuable for generating plausible counterfactual explanations and [developing more accurate, explainable forecasting models.

Technical Explanation

The main innovation of the DynForest package is its ability to handle time-dependent predictors that can be endogenous (affected by the outcome process), measured with error, and observed at different times for each study participant. This is a common challenge in longitudinal and survival analysis studies.

To address this, DynForest uses flexible linear mixed models to summarize the time-dependent predictors into individual features that can be used in the tree-building process of the random forest algorithm. The specific model used to summarize the time-dependent predictors is pre-specified by the user.

The package can then make predictions for continuous, categorical, or time-to-event (survival) outcomes. For continuous outcomes, it returns the mean prediction. For categorical outcomes, it returns the category with the majority vote. And for survival outcomes, it returns the cumulative incidence function over time.

In addition to the predictions, DynForest also computes variable importance and minimal depth measures to identify the most predictive variables or groups of variables. This can provide valuable insights into the key drivers of the outcome being studied.

Critical Analysis

The DynForest package addresses an important challenge in longitudinal and survival analysis by providing a flexible framework for incorporating time-dependent predictors. This is a significant advancement, as many real-world phenomena involve complex, time-varying relationships between predictors and outcomes.

However, the paper does not provide a detailed assessment of the package's performance compared to other methods for handling time-dependent predictors, such as joint modeling or dynamic prediction approaches. It would be helpful to see a comparative analysis to understand the relative strengths and weaknesses of DynForest.

Additionally, the paper focuses on the technical details of the package's implementation, but does not delve into potential limitations or areas for further research. For example, it would be interesting to explore how DynForest handles missing data in the time-dependent predictors, or how the package's performance scales with the complexity of the underlying data and model.

Overall, the DynForest package appears to be a valuable tool for researchers working with time-dependent predictors, but a more critical examination of its capabilities and limitations would help readers assess its suitability for their specific research needs.

Conclusion

The DynForest R package provides a robust and flexible framework for fitting random forests to predict continuous, categorical, or time-to-event outcomes using both time-fixed and time-dependent predictors. Its ability to handle endogenous, error-prone, and unevenly observed time-dependent predictors is a significant advancement in the field of longitudinal and survival analysis.

By using advanced statistical techniques to summarize the time-dependent predictors, DynForest enables researchers to make accurate predictions and gain insights into the key drivers of the outcomes they're studying. This can be valuable for developing more accurate and explainable forecasting models, as well as generating plausible counterfactual explanations to better understand the underlying relationships in the data.

While the paper provides a thorough technical explanation of the DynForest package, a more critical analysis of its performance and limitations would help researchers assess its suitability for their specific research needs. Overall, DynForest represents an important contribution to the field of longitudinal and survival analysis, with the potential to drive new insights and applications in a wide range of domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎲

Random Forests for time-fixed and time-dependent predictors: The DynForest R package

Anthony Devaux (BPH, GIGH, UNSW), C'ecile Proust-Lima (BPH), Robin Genuer (BPH)

The R package DynForest implements random forests for predicting a continuous, a categorical or a (multiple causes) time-to-event outcome based on time-fixed and time-dependent predictors. The main originality of DynForest is that it handles time-dependent predictors that can be endogeneous (i.e., impacted by the outcome process), measured with error and measured at subject-specific times. At each recursive step of the tree building process, the time-dependent predictors are internally summarized into individual features on which the split can be done. This is achieved using flexible linear mixed models (thanks to the R package lcmm) which specification is pre-specified by the user. DynForest returns the mean for continuous outcome, the category with a majority vote for categorical outcome or the cumulative incidence function over time for survival outcome. DynForest also computes variable importance and minimal depth to inform on the most predictive variables or groups of variables. This paper aims to guide the user with step-by-step examples for fitting random forests using DynForest.

4/12/2024

missForestPredict -- Missing data imputation for prediction settings

Elena Albu, Shan Gao, Laure Wynants, Ben Van Calster

Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.

7/8/2024

forester: A Tree-Based AutoML Tool in R

Hubert Ruczy'nski, Anna Kozak

The majority of automated machine learning (AutoML) solutions are developed in Python, however a large percentage of data scientists are associated with the R language. Unfortunately, there are limited R solutions available. Moreover high entry level means they are not accessible to everyone, due to required knowledge about machine learning (ML). To fill this gap, we present the forester package, which offers ease of use regardless of the user's proficiency in the area of machine learning. The forester is an open-source AutoML package implemented in R designed for training high-quality tree-based models on tabular data. It fully supports binary and multiclass classification, regression, and partially survival analysis tasks. With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis.

9/10/2024

Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series

Donato Riccio, Fabrizio Maturo, Elvira Romano

Functional data analysis (FDA) and ensemble learning can be powerful tools for analyzing complex environmental time series. Recent literature has highlighted the key role of diversity in enhancing accuracy and reducing variance in ensemble methods.This paper introduces Randomized Spline Trees (RST), a novel algorithm that bridges these two approaches by incorporating randomized functional representations into the Random Forest framework. RST generates diverse functional representations of input data using randomized B-spline parameters, creating an ensemble of decision trees trained on these varied representations. We provide a theoretical analysis of how this functional diversity contributes to reducing generalization error and present empirical evaluations on six environmental time series classification tasks from the UCR Time Series Archive. Results show that RST variants outperform standard Random Forests and Gradient Boosting on most datasets, improving classification accuracy by up to 14%. The success of RST demonstrates the potential of adaptive functional representations in capturing complex temporal patterns in environmental data. This work contributes to the growing field of machine learning techniques focused on functional data and opens new avenues for research in environmental time series analysis.

9/14/2024