RegularizedSCA - an R package for regularized simultaneous component based data integration

Introduction

This R package is a collection of functions for performing multi-block data integration. Specifically, the functions in this package deal with concatenated data blocks that share the same observation units (i.e., rows; such as, subjects). For example, genomic data, denoted $$\mathbf{X}_{I \times J}$$, and behavioral data, $$\mathbf{X}_{I \times K}$$, with respect to the same persons can be concatenated $$\mathbf{X}_C=[\mathbf{X}_{I \times J}\; \mathbf{X}_{I \times K}]$$, and thus be jointly analyzed.

Notice that the common and distinctive processes (also refered to as common and distinctive components) are with respect to the estimated component loading matrix $$\mathbf{P}$$. As for the technical details, we refer to:

• Van Deun, K., Wilderjans, T.F., Van Den Berg, R.A., Antoniadis, A., & Van Mechelen, I. (2011). A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics. 12(1), 448.
• Gu, Z., & Van Deun, K. (2016). A variable selection method for simultaneous component based data integration. , 158, 187-199.

Data pre-processing

Raw data must be standardized (i.e., pre-processed) before analysis, and thus we provide a function mySTD(). The following paper provides a nice overview of how and why raw data should be pre-processed:

• Van Deun, K., Smilde, A.K., van der Werf, M.J., Kiers, H.A.L, & Mechelen, I.V. (2009). A structured overview of simultaneous component based data integration. BMC Bioinformatics, 10:246.

Identify common and distinctive components

Situation 1: No prior information on common and distinctive components

When no prior information is available, users may first try the function VAF() with various numbers of components. This function provides an overview of proportions of variance accounted for (VAF) for each component in each block. The idea is that we let VAF() do the analysis with an arbitrarily large number of components, say R*. Then the results of VAF() will show that only a smaller nuber is needed (i.e., R << R*), and thus we have found R. summary() is avaiable for summarizing the results of VAF(). For referenc e of VAF(), see:

• Schouteden, M., Van Deun, K., Wilderjans, T. F., & Van Mechelen, I. (2014). Performing DISCO-SCA to search for distinctive and common information in linked data. Behavior research methods, 46(2), 576-587.
• Schouteden, M., Van Deun, K., Pattyn, S., & Van Mechelen, I. (2013). SCA with rotation to distinguish common and distinctive information in linked data. Behavior research methods, 45(3), 822-833.

DISCOsca() tries all possible combinations of common and distinctive patterns in $$\mathbf{P}$$ matrix. Note that this algorithm utilizes a specific rule to determine common and distinctive processes: To put it simply, for each component across the entire data block, the algorithm calculates the distance among the sum of squares of the loadings per block (weighted by total variance of the block) to determine whether it is a common/distinctive component. summary() is avaiable for summarizing the results of DISCOsca(). For technical details, see

• Schouteden, M., Van Deun, K., Wilderjans, T. F., & Van Mechelen, I. (2014). Performing DISCO-SCA to search for distinctive and common information in linked data. Behavior research methods, 46(2), 576-587.

pca_gca() identifies common and distinctive components in a two-stage procedure: First perform principal component analysis on each data block, and then perform a canonical correlation analysis on the component loadings across all the data blocks. In case of more than 2 data blocks, users may need to repeatedly apply the function to see, for example, whether some of the blocks (but not all) share common components. For technical details, see

• Tenenhaus, A., & Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika, 76(2), 257-284.
• Smilde, A.K., Mage, I., Naes, T., Hankemeier, T., Lips, M.A., Kiers, H.A., Acar, E., & Bro, R. (2016). Common and distinct components in data fusion. arXiv preprint arXiv:1607.02328.

sparseSCA() is an algorithm for SCA models with a Lasso penalty and a Group Lasso penalty. The algorithm can identify common and distinctive components as long as the proper tuning parameters for Lasso and Group Lasso are chosen. Uers may use cv_sparseSCA() to find suitable values for the Lasso and Group Lasso tuning parameters and use plot() to check the cross-validation plot. maxLGlasso() helps to identy the max values for the Lasso and Group Lasso tuning parameters (that is, the smallest value for Lasso and Group Lasso penalties that generate $$\mathbf{P}=\mathbf{0}$$. Also noted that, sparseSCA() incorporates a multi-start procedure to deal with the local minima problem. summary() function is available for summarizing the results of cv_sparseSCA() and sparseSCA(). For technical details, see

• Friedman, J., Hastie, T., & Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736.
• Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49-67.

Finally, the shrinkage of the non-zero loadings of the estimated $$\mathbf{P}$$ matrix can be undone by undoShrinkage(), and its results can be summarized by summary().

Situation 2: Prior information on common and distinctive components is available

This situation can happen, for example, when previous research has already provided some information on the common/distictive processes. In this case, researchers may want to specify a particular structure in the estimated $$\mathbf{P}$$ matrix; in particular, some elements in $$\mathbf{P}$$ are fixed to 0’s.

To this end, users may want to use the function structuredSCA() or cv_structuredSCA(). structuredSCA() allows for flexibly estimating $$\mathbf{P}$$ given a pre-defined structure in $$\mathbf{P}$$. The algorithm incorprates a Lasso penalty to achieve sparseness (and thus the Group Lasso is droped). To identify a suitable range of Lasso tuning parmaters, users are advised to try cv_structuredSCA(). summary() is avaiable for the summarizing the results of cv_structuredSCA() and structuredSCA(). Further plot() is available for showing the cross-validation plot for cv_structuredSCA(). For technical details, see

• Gu, Z., & Van Deun, K. (2016). A variable selection method for simultaneous component based data integration. , 158, 187-199.