Step-by-step guide to the Coxmos pipeline

Pedro Salguero García

Polytechnic University of Valencia and Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
pedrosalguerog[at]gmail.com

Last updated: 22 marzo, 2024

Introduction

The Coxmos R package includes the following basic analysis blocks:

  1. Cross-validation and Modeling for High-Dimensional Data: The user can use the Coxmos package to select the optimal parameters for survival models in high-dimensional datasets. The package provides tools for estimating the best values for the parameters.

  2. Comparing Classical and High-Dimensional Survival Models: After obtaining multiple survival models, the user can compare them to determine which one gives the best results. The package includes several functions for comparing the models.

  3. Interpreting Results: After selecting the best model or models, the user can interpret the results. The package includes several functions for understanding the impact of the original variables on survival prediction, even when working with (s)PLS methods.

  4. Predicting New Observations: Finally, if a new dataset of patients is available, the user can use the model to make predictions for the new patients and compare the variables against the model coefficients to estimate the patients’ risk of an event.

Installation

Coxmos can be installed via CRAN:

install.packages("Coxmos")

Or from GitHub using devtools R package:

install.packages("devtools")
devtools::install_github("BiostatOmics/Coxmos")

Getting ready

To run the analyses in this vignette, you’ll first need to load Coxmos:

# load Coxmos
library(Coxmos)

In addition, we’ll require some additional packages for data formatting. Most of them are signaled as Coxmos dependencies, so they will already be installed in your system.

To generate better plots, we make use of the RColorConesa R package, but it is an optional R package.

# install.packages("RColorConesa")
library(RColorConesa)

Input data

Most survival methods in the Coxmos package require only two inputs to function correctly. First one must be a matrix of the features under study, and the second one a matrix with two columns called time and event for survival information.

After quality control, the data contains expression data for 150 observations and 369 proteins. The data could be load as follows:

# load dataset
data("X_proteomic")
data("Y_proteomic")

X <- X_proteomic
Y <- Y_proteomic

rm(X_proteomic, Y_proteomic)

X and Y are two data.frame objects. X is related to the explanatory variables. Rows are observations and columns are the different variables of interest. For Y matrix, rows are the observations with the same row names as X, and it has to have two columns. The first one is called “time” and the second one, “event”. “time” refers to the following up time until the event or until the last control if the observation is a right censored observation. The event could be TRUE/FALSE or 1/0 for those observations that have reach the event or are censored. An example could be show in the next code:

7529 7531 7534 1978 7158
TCGA-A2-A0SV-01A 0.025273 -0.030661 0.458350 1.369000 -0.40781
TCGA-A2-A0YT-01A 0.348210 0.347870 0.891570 -0.188540 -0.13896
TCGA-BH-A1F0-01A 0.132340 -0.348210 0.078923 0.227890 -0.49063
TCGA-B6-A0IK-01A 0.352670 0.376670 -0.285750 -0.380980 -1.43080
TCGA-E2-A1LE-01A 0.110140 0.094990 -0.159960 -0.098148 0.40160
time event
TCGA-A2-A0SV-01A 825 TRUE
TCGA-A2-A0YT-01A 723 TRUE
TCGA-BH-A1F0-01A 785 TRUE
TCGA-B6-A0IK-01A 571 TRUE
TCGA-E2-A1LE-01A 879 TRUE

As we said, the dimension of each object is 150x369 for X matrix, and 150x2 for Y matrix.

X
150
369
Y
150
2

Coxmos has a series of plots to understand the distribution of the data. One of them is the plot of events along time. The function requires the Y matrix. As optional arguments, the user can specify:

  1. max.breaks Number of breaks for the histogram.
  2. roundTo the minimum numeric value for round break-times values. E.g. 0.25, a value of 1.32 will be rounded to 1.25 (numbers shall be rounded off to multipliers of 0.25).
  3. categories the name for each category (character vector of length two)
  4. y.text the name for the y axis
ggp_density.event <- plot_events(Y = Y, 
                                 categories = c("Censored","Death"), #name for FALSE/0 (Censored) and TRUE/1 (Event)
                                 y.text = "Number of observations", 
                                 roundTo = 0.5, 
                                 max.breaks = 15)
ggp_density.event$plot

Survival models for low/high dimensional data sets

After loading the data, it may be of interest for the user to perform a survival analysis in order to examine the relationship between explanatory variables and the outcome. However, traditional methods are only applicable for low-dimensional datasets. To address this issue, we have developed a set of functions that make use of (s)PLS techniques in combination with Cox analysis for the analysis of high-dimensional datasets.

Coxmos provides the following methodologies:

More information for each approach could be found in the help section for each function. The function name for each methodology are:

To perform a survival analysis with our example, we will use methodologies that can work with high-dimensional data. These are the set of methodologies that use PLS and COX Elastic Net techniques.

The first thing we are going to do is split our data into a train and test set. This split will be made with a proportion of 70% of the data for training and 30% for testing.

TRAIN/TEST

We will use the function createDataPartition from the R package caret. We will use a 70% - 30% split for training and testing, respectively, and set a seed for reproducible results.

set.seed(123)
index_train <- caret::createDataPartition(Y$event,
                                          p = .7, # 70% train
                                          list = FALSE,
                                          times = 1)
X_train <- X[index_train,] #106x369
Y_train <- Y[index_train,]
X_test <- X[-index_train,] #44x369
Y_test <- Y[-index_train,]

Clasical approaches

We have already mentioned that we need to work with a high-dimensional methodology. However, if we try to run a classical analysis with one of our functions such as a Cox analysis using the entire matrix X, we will get an error due to the specification of the MIN_EPV parameter.

This parameter (set by default to 5) establishes a minimum ratio of variables that should be included in a survival model according to the number of patients who experience the event. The formula is \(\frac{\text{# events}}{\text{# variables in survival model}}\). According to literature, the higher values the better. As a recommendation, values greater than 10 should be used to obtain models with good predictions, but this value should not be lower than 5.

If we perform a classical cox survival analysis we will obtain an error.

# classical approach
cox_model <- cox(X = X_train, Y = Y_train, 
                 x.center = T, x.scale = F,
                 remove_near_zero_variance = T, remove_zero_variance = T, toKeep.zv = NULL, 
                 remove_non_significant = F, alpha = 0.05, 
                 MIN_EPV = 5, FORCE = F, returnData = T, verbose = F)

Despite specifying an EPV of 5, we should have a maximum of 10 variables in our final survival model in relation to the number of events found in our training set. To compute our EPV, we can used the function getEPV() that requires the matrix X and Y as input.

EPV <- getEPV(X_train, Y_train)
EPV
#> [1] 0.1382114

As previously mentioned, our data set is high dimensional in terms of variables and patients, but also has an EPV value of 0.138, which is much lower than the recommended value in the literature for a good prediction model.

Once the problem has been demonstrated, we will proceed to perform the methods that allow us to work with high dimensional data sets. To do this, it will not only be necessary to launch the methods themselves that calculate the survival model, but also a cross validation to estimate the optimal values for each methodology.

Cross Validation

In order to perform survival analysis with our high-dimensional data, we have implemented a series of methods that utilize techniques such as Cox Elastic Net, to select a lower number of features applying a penalty or partial least squares (PLS) methodology in order to reduce the dimensionality of the input data.

To evaluate the performance of these methods, we have implemented cross-validation, which allows us to estimate the optimal parameters for future predictions based on prediction metrics such as: AIC, C-INDEX, I.BRIER and AUC. By default AUC metric (Area under the ROC curve) is used with the “cenROC” evaluator as it has provided the best results in our tests. However, multiple AUC evaluators could be used: “risksetROC”, “survivalROC”, “cenROC”, “nsROC”, “smoothROCtime_C” and “smoothROCtime_I”.

Furthermore, a mix of multiple metrics could be used to obtain the optimal model. The user has to establish different weight for each metric and all of them will be consider to compute the optimal model (the total weight must sum 1).

In addition, we have included options for normalizing data, filtering variables, and setting the minimum EPV, as well as specific parameters for each method, such as the alpha value for Cox Elastic Net and the number of components for PLS models. Overall, our cross-validation methodology allows us to effectively analyze high-dimensional survival data and optimize our model parameters.

Cox Elastic Net

Cox Elastic Net is based on the glmnet R package for survival analysis. However, the structure of the object and the way the analysis is performed has been updated by using our cross-validation methodology to estimate the optimal parameters for future predictions.

During the cross validation, we will iterate over the different alpha values between 0 and 1 (lasso and ridge regression). We can also select the maximum number of variables we want to select. In this case, we are not going to fix a specific value and the algorithm will update the value to a new one that meet the requirements for the EPV restriction.

NOTE: When a penalty value of 0 is selected, full ridge penalty is used. In this case, any EPV or max.variables restriction is used and all variables could be selected. For high dimensional data sets, values between (0, 1] are recommended.

# run cv.coxEN
cv.coxen_res <- cv.coxEN(X = X_train, Y = Y_train,
                         EN.alpha.list = c(0.1, 0.5, 0.9),
                         max.variables = ncol(X_train),
                         n_run = 2, k_folds = 5,
                         x.center = T, x.scale = F,
                         remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                         remove_variance_at_fold_level = F,
                         remove_non_significant = F, alpha = 0.05, 
                         w_AIC = 0, w_c.index = 0, w_AUC = 1, w_BRIER = 0, times = NULL, max_time_points = 15,
                         MIN_AUC_INCREASE = 0.01, MIN_AUC = 0.8, MIN_COMP_TO_CHECK = 3,
                         pred.attr = "mean", pred.method = "cenROC", fast_mode = F,
                         MIN_EPV = 5, return_models = F, 
                         returnData = F, 
                         PARALLEL = F, verbose = F, seed = 123)
cv.coxen_res

The cross validation will be omitted for time consumption, but one the best model is computed to user could see how many variables and which penalty has been selected as the best. Once the cross validation study is perform, the user must compute a survival model specifying the parameters. The user could select the values by the cross-validation object, but in this case we are going to select them by hand.

coxen_model <- coxEN(X = X_train, Y = Y_train, 
                     EN.alpha = 0.5, #cv.coxen_res$opt.EN.alpha,
                     max.variables = 8, #cv.coxen_res$opt.nvar,
                     x.center = T, x.scale = F,
                     remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL, 
                     remove_non_significant = F, alpha = 0.05, 
                     MIN_EPV = 5, returnData = T, verbose = F)
coxen_model
#> The method used is coxEN.
#> Survival model:
#>                coef exp(coef)  se(coef) robust se           z   Pr(>|z|)
#> `840`   -0.70459296 0.4943097 0.3700628 0.3623640 -1.94443433 0.05184310
#> `3897`   0.14405821 1.1549513 0.1349672 0.1356156  1.06225378 0.28812049
#> `83481`  0.10505359 1.1107701 0.1288665 0.1232818  0.85214192 0.39413533
#> `2194`  -0.15920751 0.8528194 0.1443478 0.1327259 -1.19952114 0.23032537
#> `2625`  -0.24789765 0.7804398 0.1259469 0.1144574 -2.16585117 0.03032256
#> `3620`  -0.42535528 0.6535376 0.3617911 0.2841891 -1.49673321 0.13446269
#> `4629`  -0.25749311 0.7729870 0.1644677 0.1563013 -1.64741554 0.09947266
#> `7535`  -0.01031489 0.9897381 0.2983250 0.2676666 -0.03853632 0.96926008

After obtaining the model, if we print it we could see the cox model with all the variables have been selected. However, some of the selected variables have not been significant. If we wish to calculate a model where all selected variables are significant, we should update the remove_non_significant parameter to TRUE.

coxen_model <- coxEN(X = X_train, Y = Y_train, 
                     EN.alpha = 0.5, #cv.coxen_res$opt.EN.alpha
                     max.variables = 8, #cv.coxen_res$opt.nvar
                     x.center = T, x.scale = F,
                     remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL, 
                     remove_non_significant = T, alpha = 0.05, 
                     MIN_EPV = 5, returnData = T, verbose = F)
coxen_model
#> The method used is coxEN.
#> A total of 5 variables have been removed due to non-significance filter inside cox model.
#> Survival model:
#>              coef exp(coef)  se(coef) robust se         z     Pr(>|z|)
#> `840`  -1.0335236 0.3557512 0.2982663 0.2818888 -3.666423 0.0002459674
#> `2625` -0.2526926 0.7767066 0.1155368 0.1051668 -2.402779 0.0162710096
#> `4629` -0.3361206 0.7145369 0.1372981 0.1454396 -2.311066 0.0208291919

After this change, a total of 8 variables have been selected. And a total of 5 have been removed due to non-significant. To check which variables have been removed, we can check if by the value “nsv” (no-significant variables).

coxen_model$nsv
#> [1] "7535"  "83481" "3897"  "2194"  "3620"

sPLS-ICOX

In the same way, we will also perform a cross validation for the PLS-based models, starting with the PLS-ICOX methodology. In this case, the internal construction of the weights of the PLS model for the calculation of the components of the X matrix is based on univariate cox models. After this, we are able to reduce the dimensionality of our data set to ultimately launch a cox model with the latent variables or principal components of the PLS model.

# run cv.plsicox
cv.splsicox_res <- cv.splsicox(X = X_train, Y = Y_train,
                               max.ncomp = 2, penalty.list = c(0.5, 0.9),
                               n_run = 2, k_folds = 5,
                               x.center = T, x.scale = F,
                               remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                               remove_variance_at_fold_level = F,
                               remove_non_significant_models = F, alpha = 0.05, 
                               w_AIC = 0, w_c.index = 0, w_AUC = 1, w_BRIER = 0, times = NULL, max_time_points = 15,
                               MIN_AUC_INCREASE = 0.01, MIN_AUC = 0.8, MIN_COMP_TO_CHECK = 3,
                               pred.attr = "mean", pred.method = "cenROC", fast_mode = F,
                               MIN_EPV = 5, return_models = F, remove_non_significant = F, returnData = F, 
                               PARALLEL = F, verbose = F, seed = 123)
cv.splsicox_res

If you want to see the evolution of your metric, you can access to the plots inside the cross validation object.

# plot cv.plsicox
cv.splsicox_res$plot_AUC

We will generate a PLS-ICOX model with optimal number of principal components and its penalty based on the results obtained from the cross validation.

splsicox_model <- splsicox(X = X_train, Y = Y_train, 
                           n.comp = 1, #cv.splsicox_res$opt.comp, 
                           penalty  = 0.9, #cv.splsicox_res$opt.spv_penalty,
                           x.center = T, x.scale = F,
                           remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                           remove_non_significant = T,
                           MIN_EPV = 5, returnData = T, verbose = F)

splsicox_model
#> The method used is sPLS-ICOX.
#> Survival model:
#>             coef exp(coef) se(coef) robust se        z     Pr(>|z|)
#> comp_1 0.9963425  2.708358 0.183951 0.1719039 5.795928 6.794439e-09

sPLS-DRCOX

Next, we will perform a cross validation for the sPLS-DRCOX methodology. In this case, an sPLS model is run using the deviance residuals of a null Cox model as the Y matrix. The penalties for variable selection follow the strategy used in the R package “plsRcox” from which the idea for the methodology was derived.

# run cv.splsdrcox
cv.splsdrcox_res <- cv.splsdrcox(X = X_train, Y = Y_train,
                                 max.ncomp = 2, penalty.list = c(0.5, 0.9),
                                 n_run = 2, k_folds = 5,
                                 x.center = T, x.scale = F,
                                 remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                                 remove_non_significant_models = F, alpha = 0.05, 
                                 w_AIC = 0, w_c.index = 0, w_AUC = 1, w_BRIER = 0, times = NULL,
                                 MIN_AUC_INCREASE = 0.01, MIN_AUC = 0.8, MIN_COMP_TO_CHECK = 3,
                                 pred.attr = "mean", pred.method = "cenROC", fast_mode = F,
                                 MIN_EPV = 5, return_models = F,
                                 PARALLEL = F, verbose = F, seed = 123)
cv.splsdrcox_res
# plot cv.plsicox
cv.splsdrcox_res$plot_AUC

Based on the results obtained through cross validation, we will create a PLS-DRCOX model with a single component and no eta penalty as seen in the cross validation.

splsdrcox_model <- splsdrcox(X = X_train, Y = Y_train, 
                             n.comp = 2, #cv.splsdrcox_res$opt.comp, 
                             penalty = 0.5, #cv.splsdrcox_res$opt.eta,
                             x.center = T, x.scale = F,
                             remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                             remove_non_significant = T,
                             MIN_EPV = 5, returnData = T, verbose = F)

splsdrcox_model
#> The method used is sPLS-DRCOX.
#> Survival model:
#>             coef exp(coef)  se(coef) robust se        z     Pr(>|z|)
#> comp_1 0.8557248  2.353079 0.1517433 0.1401724 6.104801 1.029287e-09
#> comp_2 0.4254844  1.530332 0.1176525 0.1026072 4.146729 3.372583e-05

sPLS-DRCOX Dynamic

For sPLS-DRCOX methodology we implemented another kind of penalty based on an heuristic variable selection and the MixOmics algorithms. In this case, the penalty is based on a vector of number of variables to test. We can select a specific value to select that number of variables, but if we keep the value to NULL, the number of variable will be selected automatically.

With this method, the user can specify the minimum and maximum number of variables and the number of cutpoints (how many number of variables to test between the minimum and the maximum number of variables) to be tested. After the first iteration, the algorithm will select the optimal number of variables and will further investigate better results between the existing cut points and the optimal value selected.

# run cv.splsdrcox
cv.splsdrcox_dynamic_res <- cv.splsdrcox_dynamic(X = X_train, Y = Y_train,
                                                 max.ncomp = 2, vector = NULL,
                                                 MIN_NVAR = 10, MAX_NVAR = 400, 
                                                 n.cut_points = 10, EVAL_METHOD = "AUC",
                                                 n_run = 2, k_folds = 5,
                                                 x.center = T, x.scale = F,
                                                 remove_near_zero_variance = T, remove_zero_variance = F, 
                                                 toKeep.zv = NULL,
                                                 remove_non_significant_models = F, alpha = 0.05,
                                                 remove_variance_at_fold_level = F, remove_non_significant = F, 
                                                 w_AIC = 0, w_c.index = 0, w_AUC = 1, w_BRIER = 0, 
                                                 times = NULL, max_time_points = 15, returnData = F,
                                                 MIN_AUC_INCREASE = 0.01, MIN_AUC = 0.8, MIN_COMP_TO_CHECK = 3,
                                                 pred.attr = "mean", pred.method = "cenROC", fast_mode = F,
                                                 MIN_EPV = 5, return_models = F,
                                                 PARALLEL = F, verbose = F, seed = 123)
cv.splsdrcox_dynamic_res

After the cross validation, the user can select the exact number of components and variables to use in their model.

splsdrcox_dynamic_model <- splsdrcox_dynamic(X = X_train, Y = Y_train, 
                                             n.comp = 2, #cv.splsdrcox_dynamic_res$opt.comp,
                                             vector = 369, #cv.splsdrcox_dynamic_res$opt.nvar,
                                             x.center = T, x.scale = F,
                                             remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                                             MIN_NVAR = 10, MAX_NVAR = 1000, n.cut_points = 5,
                                             MIN_AUC_INCREASE = 0.01,
                                             EVAL_METHOD = "AUC", pred.method = "cenROC", max.iter = 200,
                                             remove_non_significant = T, 
                                             MIN_EPV = 5, returnData = T, verbose = F)

splsdrcox_dynamic_model
#> The method used is sPLS-DRCOX-Dynamic.
#> Survival model:
#>             coef exp(coef)   se(coef) robust se        z     Pr(>|z|)
#> comp_1 0.6359930  1.888897 0.09236977 0.1067533 5.957596 2.559747e-09
#> comp_2 0.2255657  1.253031 0.05637257 0.0556032 4.056703 4.977024e-05

sPLS-DACOX

Finally, we will launch a COX-based sPLS-DA model. This algorithm is the simplest of all and with the least statistical development, but depending on the data set it can provide better results than the previous methods. In this case, we launch an sPLS-DA on the classification of patients according to whether they have suffered or not the study event without taking into account the time until the event. Then, we launch a Cox model using the latent variables of the model and using the entire Y matrix with its times and events/censored.

# run cv.splsdrcox
cv.splsdacox_dynamic_res <- cv.splsdacox_dynamic(X = X_train, Y = Y_train, 
                                                 max.ncomp = 2, vector = NULL,
                                                 MIN_NVAR = 10, MAX_NVAR = 400, 
                                                 n.cut_points = 10, EVAL_METHOD = "AUC",
                                                 n_run = 2, k_folds = 5,
                                                 x.center = T, x.scale = F,
                                                 remove_near_zero_variance = T, remove_zero_variance = F, 
                                                 toKeep.zv = NULL,
                                                 remove_variance_at_fold_level = F, remove_non_significant = F, 
                                                 remove_non_significant_models = F, alpha = 0.05, 
                                                 w_AIC = 0, w_c.index = 0, w_AUC = 1, w_BRIER = 0, 
                                                 times = NULL, max_time_points = 15, returnData = F, 
                                                 MIN_AUC_INCREASE = 0.01, MIN_AUC = 0.8, MIN_COMP_TO_CHECK = 3,
                                                 pred.attr = "mean", pred.method = "cenROC", fast_mode = F,
                                                 MIN_EPV = 5, return_models = F, max.iter = 200,
                                                 PARALLEL = F, verbose = F, seed = 123)
cv.splsdacox_dynamic_res

Once the best model is obtained through cross validation, it must be calculated using the desired parameters.

splsdacox_dynamic_model <- splsdacox_dynamic(X = X_train, Y = Y_train, 
                                             n.comp = 2, #cv.splsdacox_dynamic_res$opt.comp, 
                                             vector = 330, #cv.splsdacox_dynamic_res$opt.nvar,
                                             x.center = T, x.scale = F,
                                             remove_near_zero_variance = T, remove_zero_variance = F, toKeep.zv = NULL,
                                             MIN_NVAR = 10, MAX_NVAR = 1000, n.cut_points = 5,
                                             MIN_AUC_INCREASE = 0.01,
                                             EVAL_METHOD = "AUC", pred.method = "cenROC", max.iter = 200,
                                             remove_non_significant = T, 
                                             MIN_EPV = 5, returnData = T, verbose = F)

splsdacox_dynamic_model
#> The method used is sPLS-DACOX-Dynamic.
#> Survival model:
#>             coef exp(coef)  se(coef)  robust se        z     Pr(>|z|)
#> comp_1 0.4152963  1.514820 0.1000264 0.09125035 4.551175 5.334720e-06
#> comp_2 0.2097277  1.233342 0.0662779 0.06560106 3.197017 1.388569e-03

Comparing multiple survival models

Next, we will analyze the results obtained from the multiple models to see which one obtains the best predictions based on our data. To do this, we will use the test set that has not been used for the training of any model.

Comparing for multiple evaluators at the same time.

Initially, we will compare the area under the curve (AUC) for each of the methods according to the evaluator we want. The function is developed to simultaneously evaluate multiple evaluators. However, we will continue working with a single evaluator. In this case “cenROC”. On the other hand, we must provide a list of the different models as well as the X and Y test set we want to evaluate.

When evaluating survival model results, we must indicate at which temporal points we want to perform the evaluation. As we already specified a NULL value for the “times” variable in the cross-validation, we are going to let the algorithm to compute them again.

lst_models <- list("COX-EN" = coxen_model,
                   "sPLS-ICOX" = splsicox_model,
                   "sPLS-DRCOX" = splsdrcox_model,
                   "sPLS-DRCOX-Dynamic" = splsdrcox_dynamic_model,
                   "sPLS-DACOX-Dynamic" = splsdacox_dynamic_model)

eval_results <- eval_Coxmos_models(lst_models = lst_models,
                                  X_test = X_test, Y_test = Y_test, 
                                  pred.method = "cenROC",
                                  pred.attr = "mean",
                                  times = NULL, max_time_points = 15, 
                                  PARALLEL = F)

In case you prefer to test multiple AUC evaluators, a list of them could be proportionate by using the library “purrr”.

lst_evaluators <- c(cenROC = "cenROC", risksetROC = "risksetROC")

eval_results <- purrr::map(lst_evaluators, ~eval_Coxmos_models(lst_models = lst_models,
                                                              X_test = X_test, Y_test = Y_test,
                                                              pred.method = .,
                                                              pred.attr = "mean",
                                                              times = NULL, 
                                                              max_time_points = 15,
                                                              PARALLEL = F))

We can print the results obtained in the console where we can see, for each of the selected methods, the training time and the time it took to be evaluated, as well as their AIC, C-Index and AUC metrics and at which time points it was evaluated.

eval_results
#> Evaluation performed for methods: COX-EN, sPLS-ICOX, sPLS-DRCOX, sPLS-DRCOX-Dynamic, sPLS-DACOX-Dynamic.
#> COX-EN:
#>  training.time: 0.0065
#>  evaluating.time: 0.0031
#>  AIC: 353.447
#>  c.index: 0.6906
#>  time: 116, 369.786, 623.572, 877.358, 1131.143, 1384.929, 1638.715, 1892.5, 2146.286, 2400.072, 2653.858, 2907.643, 3161.429, 3415.215, 3669
#>  AUC: 0.57097
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I.Brier: 0.17834
#> 
#> sPLS-ICOX:
#>  training.time: 0.0359
#>  evaluating.time: 0.0026
#>  AIC: 341.6841
#>  c.index: 0.6971
#>  time: 116, 369.786, 623.572, 877.358, 1131.143, 1384.929, 1638.715, 1892.5, 2146.286, 2400.072, 2653.858, 2907.643, 3161.429, 3415.215, 3669
#>  AUC: 0.52323
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I.Brier: 0.17053
#> 
#> sPLS-DRCOX:
#>  training.time: 0.0067
#>  evaluating.time: 0.0026
#>  AIC: 328.224
#>  c.index: 0.7811
#>  time: 116, 369.786, 623.572, 877.358, 1131.143, 1384.929, 1638.715, 1892.5, 2146.286, 2400.072, 2653.858, 2907.643, 3161.429, 3415.215, 3669
#>  AUC: 0.65051
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I.Brier: 0.1739
#> 
#> sPLS-DRCOX-Dynamic:
#>  training.time: 0.0067
#>  evaluating.time: 0.0028
#>  AIC: 320.6931
#>  c.index: 0.8187
#>  time: 116, 369.786, 623.572, 877.358, 1131.143, 1384.929, 1638.715, 1892.5, 2146.286, 2400.072, 2653.858, 2907.643, 3161.429, 3415.215, 3669
#>  AUC: 0.61159
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I.Brier: 0.19055
#> 
#> sPLS-DACOX-Dynamic:
#>  training.time: 0.006
#>  evaluating.time: 0.0026
#>  AIC: 342.8825
#>  c.index: 0.7181
#>  time: 116, 369.786, 623.572, 877.358, 1131.143, 1384.929, 1638.715, 1892.5, 2146.286, 2400.072, 2653.858, 2907.643, 3161.429, 3415.215, 3669
#>  AUC: 0.60048
#>  brier_time: 116, 186, 322, 377, 385, 454, 477, 489, 506, 518, 522, 606, 614, 616, 678, 703, 723, 747, 759, 825, 847, 904, 1015, 1048, 1150, 1174, 1189, 1233, 1275, 1324, 1508, 1528, 1611, 1694, 1742, 1793, 1935, 1972, 2632, 2854, 2866, 3001, 3472, 3669
#>  I.Brier: 0.1897
#> 

Plot comparison

However, we can also obtain graphical results where we can compare each method over time, as well as their average scores using the function “plot_evaluation” or “plot_evaluation.list” if multiple evaluators have been tested. The user could choose to plot the AUC for time prediction points or Brier Score. In case of use Brier Score, instead of uses the Integrative Brier Score for the boxplots, the mean or median is used (plot_evaluation parameter).

lst_eval_results <- plot_evaluation(eval_results, evaluation = "AUC", pred.attr = "mean")
lst_eval_results_BRIER <- plot_evaluation(eval_results, evaluation = "Brier", pred.attr = "mean")

After performing the cross-validation, we obtain a list in R that contains two new lists. The first of these refers to the evaluation over time for each of the methods used, as well as a variant where the average value of each of them is shown. On the other hand, we can compare the mean results of each method using: T-test, Wilcoxon-test, anova or Kruskal-Wallis.

lst_eval_results$lst_plots$lineplot.mean

lst_eval_results$lst_plot_comparisons$anova


# lst_eval_results$cenROC$lst_plots$lineplot.mean
# lst_eval_results$cenROC$lst_plot_comparisons$t.test

Computing time comparison (we use the cross validation models)

Another possible comparison is related to the computation times for cross-validation, model creation, and evaluation. In this case, cross-validations and methods are loaded.

lst_models_time <- list(#cv.coxen_res,
                        coxen_model,
                        #cv.splsicox_res,
                        splsicox_model,
                        #cv.splsdrcox_res,
                        splsdrcox_model,
                        #cv.splsdrcox_dynamic_res,
                        splsdrcox_dynamic_model,
                        #cv.splsdacox_dynamic_res,
                        splsdacox_dynamic_model, 
                        eval_results)
ggp_time <- plot_time.list(lst_models_time)
ggp_time

Individual interpretations of the results

Following the cross validation, we have selected the sPLS-DACOX methodology as the most suitable model for our data. We will now study and interpret its results based on the study variables or latent variables used. In this case, we will examine some graphs of the model.

Forest plots

A forest plot can be obtained as the first graph using the survminer R package. However, the function has been restructured to allow for simultaneous launch of an Coxmos class model or a list of Coxmos models using the plot_forest() or plot_forest.list() function.

#lst_forest_plot <- plot_forest.list(lst_models)
lst_forest_plot <- plot_forest(lst_models$`sPLS-DRCOX`)
#lst_forest_plot$`sPLS-DRCOX`
lst_forest_plot

PH Assumption

The following graph is related to one of the assumptions of the Cox models, called proportional hazard.

In a Cox proportional hazards model, the proportional hazards assumption states that the hazard ratio (the risk of experiencing the event of interest) is constant over time for a given set of predictor variables. This means that the effect of the predictors on the hazard ratio does not change over time. This assumption is important because it allows for the interpretation of the model’s coefficients as measures of the effect of the predictors on the hazard ratio. Violations of the proportional hazards assumption can occur if the effect of the predictors on the hazard ratio changes over time or if there is an interaction between the predictors and time. In these cases, the coefficients of the model may not accurately reflect the effect of the predictors on the hazard ratio and the results of the model may not be reliable.

In this way, to visualize and check if the assumption is violated, the function plot_proportionalHazard.list() or plot_proportionalHazard() can be called, depending on whether a list of models or a specific model is to be evaluated.

#lst_ph_ggplot <- plot_proportionalHazard.list(lst_models)
lst_ph_ggplot <- plot_proportionalHazard(lst_models$`sPLS-DRCOX`)

Variables or components with a significant P-Value indicate that the assumption is being violated.

#lst_ph_ggplot$`sPLS-DRCOX`
lst_ph_ggplot

Density plots

Another type of graph implemented for all models, whether they belong to the classical branch or to PLS-based models, is the visualization of observations by event according to the values predicted by the Cox models.

The R package “coxph” allows for several types of predictions to be made on a Cox model that we use in our function, which are:

According to the predicted value, we can classify the observations along their possible values and see their distribution for each of the different models.

#density.plots.lp <- plot_cox.event.list(lst_models, type = "lp")
density.plots.lp <- plot_cox.event(lst_models$`sPLS-DRCOX`, type = "lp")
# density.plots.lp$`sPLS-DRCOX`$plot.density
# density.plots.lp$`sPLS-DRCOX`$plot.histogram

density.plots.lp$plot.density

density.plots.lp$plot.histogram

Studying PLS model

For those models based on PLS components, the PLS could be studied in terms of loadings/scores. In order to get the plots, the function plot_PLS_Coxmos() has been developed where the user could specify to see “scores”, “loadings” or a “biplot” for a couple of latent variables. By default, if no factor is given, samples are grouped by event.

ggp_scores <- plot_PLS_Coxmos(model = lst_models$`sPLS-DRCOX`, 
                             comp = c(1,2), mode = "scores")
ggp_scores$plot

ggp_loadings <- plot_PLS_Coxmos(model = lst_models$`sPLS-DRCOX`, 
                               comp = c(1,2), mode = "loadings",
                               top = 10) #length from 0,0
ggp_loadings$plot

ggp_biplot <- plot_PLS_Coxmos(model = lst_models$`sPLS-DRCOX`, 
                             comp = c(1,2), mode = "biplot",
                             top = 15,
                             only_top = T,
                             overlaps = 20)
ggp_biplot$plot

Understanding the results in terms of the original variables (PLS models)

When a PLS-COX model is computed, the final survival model is based on the PLS latent variables. Although those new variables can explain the survival, in order to understand the disease, new coefficients could be computed in terms of the original variables.

Coxmos is able to transfer the component beta coefficient to original variables by decomposing the coefficients by using the weight of the variables in that latent variables.

Understanding the relative contribution per variable/component

However, before studying the original variables, if a PLS model is computed. Coxmos also proportionates plots to see how many % of AUC is computed per each component in order to see which components or latent variables are more related to the observation survival.

The % of AUC explanation per component could be calculated for TRAIN or TEST data. Train data was used to compute the best model, so the sum of variables/components will generate a better LP (linear predictor) performance. However, when test data is used, could happen that a specific variable or a component could be a better predictor than the full model.

variable_auc_results <- eval_Coxmos_model_per_variable(model = lst_models$`sPLS-DRCOX`, 
                                                       X_test = lst_models$`sPLS-DRCOX`$X_input, 
                                                       Y_test = lst_models$`sPLS-DRCOX`$Y_input,
                                                       pred.method = "cenROC", pred.attr = "mean",
                                                       times = NULL, max_time_points = 15,
                                                       PARALLEL = FALSE)

variable_auc_plot_train <- plot_evaluation(variable_auc_results, evaluation = "AUC")
variable_auc_plot_train$lst_plots$lineplot.mean

The plot shows the AUC for the full model (called LP), and then, the AUC per each variable (for classical methods) or components (for PLS methods).

Psudobeta

In order to improve the interpretability of a PLS model, a subset of the most influential variables can be selected. In this example, the top 20 variables are chosen. Additionally, non-significant PLS components are excluded by setting the “onlySig” parameter to “TRUE”.

# ggp.simulated_beta <- plot_pseudobeta.list(lst_models = lst_models, 
#                                            error.bar = T, onlySig = T, alpha = 0.05, 
#                                            zero.rm = T, auto.limits = T, top = 20,
#                                            show_percentage = T, size_percentage = 2, verbose = F)

ggp.simulated_beta <- plot_pseudobeta(model = lst_models$`sPLS-DRCOX`, 
                                      error.bar = T, onlySig = T, alpha = 0.05, 
                                      zero.rm = T, auto.limits = T, top = 20,
                                      show_percentage = T, size_percentage = 2)

The sPLS-ICOX model was computed using a total of 5 components. Although these components were classified as dangerous for the observations (based on coefficients greater than one), certain variables within the components may still have a protective effect, depending on their individual weights.

The following plot illustrates the pseudo-beta coefficient for the original variables in the sPSL-DRCOX-Dynamic model. As only the top 20 variables are shown, the plot represents only 25.07% of the total linear predictor total value. To view the complete model, all variables would need to be included in the plot by assigning the value NULL to the “top” parameter.

#ggp.simulated_beta$`sPLS-DRCOX`$plot
ggp.simulated_beta$plot

Kaplan-Meier

For a more intuitive understanding of the model, the user can also employ the getAutoKM.list() or getAutoKM() functions to generate Kaplan-Meier curves. These functions allow the user to view the KM curve for the entire model, a specific component of a PLS model, or for individual variables.

Full model

To run the full Kaplan-Meier model, the “type” parameter must be set to “LP” (linear predictors). This means that the linear predictor value for each patient will be used to divide the groups (in this case, the score value multiplied by the Cox coefficient). The surv_cutpoint() function from the R package “survminer” is used to determine the optimal cut-point. Other parameters are not used in this specific method.

# LST_KM_RES_LP <- getAutoKM.list(type = "LP",
#                                 lst_models = lst_models,
#                                 comp = 1:10,
#                                 top = 10,
#                                 ori_data = T,
#                                 BREAKTIME = NULL,
#                                 only_sig = T, alpha = 0.05)

LST_KM_RES_LP <- getAutoKM(type = "LP",
                           model = lst_models$`sPLS-DRCOX`,
                           comp = 1:10,
                           top = 10,
                           ori_data = T,
                           BREAKTIME = NULL,
                           only_sig = T, alpha = 0.05)

As a result, the Kaplan-Meier curve could be plotted.

#LST_KM_RES_LP$`sPLS-DRCOX`$LST_PLOTS$LP
LST_KM_RES_LP$LST_PLOTS$LP

After generating a Kaplan-Meier curve for the model, the cutoff value, which is used to divide the observations into two groups, can be used to evaluate how the test data is classified. The vector of cutoffs for multiple models, when getAutoKM.list() is applied, can be retrieved by using the function getCutoffAutoKM.list() and passing the output of getAutoKM.list() as a parameter.

Once the vector is obtained, the function getTestKM.list() or getTestKM() can be run with the list of models, the X test data, Y test data, the list of cutoffs or a single value, and the desired number of breaks for the new Kaplan-Meier plot.

A log-rank test will be displayed to determine if the chosen cutoff is an effective way to split the data into groups with higher and lower risk.

# lst_cutoff <- getCutoffAutoKM.list(LST_KM_RES_LP)
# LST_KM_TEST_LP <- getTestKM.list(lst_models = lst_models, 
#                                  X_test = X_test, Y_test = Y_test, 
#                                  type = "LP",
#                                  BREAKTIME = NULL, n.breaks = 20,
#                                  lst_cutoff = lst_cutoff)

lst_cutoff <- getCutoffAutoKM(LST_KM_RES_LP)
LST_KM_TEST_LP <- getTestKM(model = lst_models$`sPLS-DRCOX`, 
                            X_test = X_test, Y_test = Y_test, 
                            type = "LP",
                            BREAKTIME = NULL, n.breaks = 20,
                            cutoff = lst_cutoff)
#LST_KM_TEST_LP$`sPLS-DRCOX`
LST_KM_TEST_LP

Components

To generate a Kaplan-Meier curve for a specific component, the “type” parameter must be set to “COMP” (component). This means that the linear predictor is computed using only one component at a time to split the groups. In this case, the “comp” parameter can be used to specify which component should be computed (if multiple components, each one will be computed separately).

# LST_KM_RES_COMP <- getAutoKM.list(type = "COMP",
#                                   lst_models = lst_models,
#                                   comp = 1:10,
#                                   top = 10,
#                                   ori_data = T,
#                                   BREAKTIME = NULL,
#                                   n.breaks = 20,
#                                   only_sig = T, alpha = 0.05)

LST_KM_RES_COMP <- getAutoKM(type = "COMP",
                             model = lst_models$`sPLS-DRCOX`,
                             comp = 1:10,
                             top = 10,
                             ori_data = T,
                             BREAKTIME = NULL,
                             n.breaks = 20,
                             only_sig = T, alpha = 0.05)
# LST_KM_RES_COMP$`sPLS-DRCOX`$LST_PLOTS$comp_1
# LST_KM_RES_COMP$`sPLS-DRCOX`$LST_PLOTS$comp_2

LST_KM_RES_COMP$LST_PLOTS$comp_1

LST_KM_RES_COMP$LST_PLOTS$comp_2

# lst_cutoff <- getCutoffAutoKM.list(LST_KM_RES_COMP)
# LST_KM_TEST_COMP <- getTestKM.list(lst_models = lst_models, 
#                                    X_test = X_test, Y_test = Y_test, 
#                                    type = "COMP",
#                                    BREAKTIME = NULL, n.breaks = 20,
#                                    lst_cutoff = lst_cutoff)

lst_cutoff <- getCutoffAutoKM(LST_KM_RES_COMP)
LST_KM_TEST_COMP <- getTestKM(model = lst_models$`sPLS-DRCOX`, 
                              X_test = X_test, Y_test = Y_test, 
                              type = "COMP",
                              BREAKTIME = NULL, n.breaks = 20,
                              cutoff = lst_cutoff)
# all patients could be categorize into the same group
# LST_KM_TEST_COMP$`sPLS-DRCOX`$comp_1
# LST_KM_TEST_COMP$`sPLS-DRCOX`$comp_2

LST_KM_TEST_COMP$comp_1

LST_KM_TEST_COMP$comp_2

Original variables

To generate a Kaplan-Meier curve for original variables, the “type” parameter must be set to “VAR” (variable). In this case, the “ori_data” parameter can be used to determine whether the original or normalized values of the variables should be used.

Additionally, the “top” parameter can be used to plot a specific number of variables, sorted by P-Value (Log-Rank test). The best cutpoint is determined by the “surv_cutpoint” function. Both qualitative and quantitative variables are supported.

# LST_KM_RES_VAR <- getAutoKM.list(type = "VAR",
#                                  lst_models = lst_models,
#                                  comp = 1:10, #select how many components you want to compute for the pseudo beta
#                                  top = 10,
#                                  ori_data = T, #original data selected
#                                  BREAKTIME = NULL,
#                                  only_sig = T, alpha = 0.05)

LST_KM_RES_VAR <- getAutoKM(type = "VAR",
                            model = lst_models$`sPLS-DRCOX`,
                            comp = 1:10, #select how many components you want to compute for the pseudo beta
                            top = 10,
                            ori_data = T, #original data selected
                            BREAKTIME = NULL,
                            only_sig = T, alpha = 0.05)
# LST_KM_RES_VAR$`sPLS-DRCOX`$LST_PLOTS$`840`
# LST_KM_RES_VAR$`sPLS-DRCOX`$LST_PLOTS$`3897`

LST_KM_RES_VAR$LST_PLOTS$`840`

LST_KM_RES_VAR$LST_PLOTS$`3897`

# lst_cutoff <- getCutoffAutoKM.list(LST_KM_RES_VAR)
# LST_KM_TEST_VAR <- getTestKM.list(lst_models = lst_models, 
#                                   X_test = X_test, Y_test = Y_test, 
#                                   type = "VAR", ori_data = T,
#                                   BREAKTIME = NULL, n.breaks = 20,
#                                   lst_cutoff = lst_cutoff)

lst_cutoff <- getCutoffAutoKM(LST_KM_RES_VAR)
LST_KM_TEST_VAR <- getTestKM(model = lst_models$`sPLS-DRCOX`, 
                             X_test = X_test, Y_test = Y_test, 
                             type = "VAR", ori_data = T,
                             BREAKTIME = NULL, n.breaks = 20,
                             cutoff = lst_cutoff)
# LST_KM_TEST_VAR$`sPLS-DRCOX`$`840`
# LST_KM_TEST_VAR$`sPLS-DRCOX`$`3897`

LST_KM_TEST_VAR$`840`

LST_KM_TEST_VAR$`3897`

New patients

In addition, Coxmos can also manage new patients to perform predictions.

To demonstrate, an observation from a test dataset will be used.

new_pat <- X_test[1,,drop=F]

As shown, this is a censored patient who has the last observation at time 254.

knitr::kable(Y_test[rownames(new_pat),])
time event
TCGA-A2-A0SV-01A 825 TRUE

The function plot_pseudobeta_newObservation.list() or plot_pseudobeta_newObservation() allows the user to compare the characteristics of a new patient with the pseudo-beta values obtained from a specific model. The goal is to understand in a general way which variables are associated with an increased risk of an event or a decreased risk, in comparison to the variables predicted by the model.

# ggp.simulated_beta_newObs <- plot_pseudobeta_newObservation.list(lst_models = lst_models, 
#                                                              new_observation = new_pat,
#                                                              error.bar = T, onlySig = T, alpha = 0.05,
#                                                              zero.rm = T, auto.limits = T, show.betas = T, top = 20)

ggp.simulated_beta_newObs <- plot_pseudobeta_newObservation(model = lst_models$`sPLS-DRCOX`,
                                                        new_observation = new_pat,
                                                        error.bar = T, onlySig = T, alpha = 0.05,
                                                        zero.rm = T, auto.limits = T, show.betas = T, top = 20)

On the left, a linear predictor value is shown for the observation and each variable. On the right, the pseudo-beta coefficients for the model’s original variables are illustrated. This allows the user to compare the direction of the linear predictors per variable. A change in direction means that the variable’s value is below the mean for hazard variables or above the mean for protective ones. Having an opposite direction for hazard variables and maintaining the direction for protective variables makes the observation safer over time.

#ggp.simulated_beta_newObs$`sPLS-DRCOX`$plot

ggp.simulated_beta_newObs$plot