1 Introduction

Weighted Quantile Sum (WQS) regression is a statistical approach for multivariate regression in high-dimensional data with complex correlation patterns commonly encountered in environmental exposures, epi/genomics, and metabolomic studies, among others. The model constructs an empirically weighted index estimating the mixture effect of predictor (e.g., exposure) variables on an outcome, which may then be used in a regression model with relevant covariates to test the association of the index with a dependent variable or outcome. The contribution of each individual predictor to the overall index effect may is assessed by the relative strength of the estimated weights since the components are quantiled and are therefore on the same scale.

The gWQS package extends WQS regression to applications with continuous and categorical outcomes and implements the random subset WQS and the repeated holdout WQS. In practical terms, the primary outputs of an analysis are the parameter estimates and significance tests for the overall index effect of predictor variables, and the estimated weights assigned to each predictor, which identify the relevant contribution of each variable to the relationship between the WQS index and the outcome variable.

For additional theoretical background on WQS regression and its extensions, see the references provided below.

2 How to use the gWQS package

The main functions of the gWQS package are gwqs and gwqs_multinom. The first extends WQS regression to applications with continuous, categorical and count outcomes; the second extends WQS regression to applications with categorical outcomes with more than 2 non-ordinal categories while both functions include the option rs that allows to apply a random subset implementation of WQS and the argument rh that if set greater than 1 implements a repeated holdout validation procedure. In this vignette we will only show the application of WQS to a continuous outcome. We created the wqs_data dataset (available once the package is installed and loaded) to show how to use this function. These data reflect 59 exposure concentrations simulated from a distribution of 34 PCB exposures and 25 phthalate biomarkers measured in subjects participating in the NHANES study (2001-2002). Additionally, 8 outcome measures were simulated applying different distributions and fixed beta coefficients to the predictors. In particular y and yLBX were simulated from a normal distribution, ybin and ybinLBX from a binomial distribution, ymultinom and ymultinomLBX from a multinomial distribution and ycount and ycountLBX from a Poisson distribution. The sex variable was also simulated to allow to adjust for a covariate in the model (see the wqs_data help page for more details). This dataset can thus be used to test the gWQS package by analyzing the mixture effect of the simulated chemicals on the different outcomes, with adjustments for covariates.

2.1 Example

2.1.1 Step 1

Following the algorithm proposed in Renzetti et al. (2023) we start fitting a WQS regression with two indices, one exploring the positive and the second exploring the negative direction specifying the terms pwqs and nwqs, respectively. We also opted for the repeated holdout validation procedure to get more stable results. The following script calls a WQS model for a continuous outcome using the function gwqs that returns an object of class gwqs; the three functions gwqs_barplot, gwqs_scatterplot and gwqs_fitted_vs_resid allows to plot the figures shown in figure 2.1:

library(gWQS)
## Welcome to Weighted Quantile Sum (WQS) Regression.
## If you are using a Mac you have to install XQuartz.
## You can download it from: https://www.xquartz.org/
library(ggplot2)
library(knitr)
library(kableExtra)
library(reshape2)

# we save the names of the mixture variables in the variable "PCBs"
PCBs <- names(wqs_data)[1:34]
# we run the model and save the results in the variable "results2i"
results2i <- gwqs(yLBX ~ pwqs + nwqs, mix_name = PCBs, data = wqs_data, 
                  q = 10, validation = 0.6, b = 5, b1_pos = TRUE, rh = 5,
                  family = "gaussian", seed = 2016)

This WQS model tests the relationship between our dependent variable, y, and a WQS index estimated from ranking exposure concentrations in deciles (q = 10); in the gwqs formula the wqs (when we want to build a single index) or the pwqs and nwqs (when we want to build a double index) terms must be included as if they were present in the dataset. The data were divided in 40% of the dataset for training and 60% for validation (validation = 0.6) and we repeated this split procedure 5 times (rh = 5), and 5 bootstrap samples (b = 5) for parameter estimation were assigned (in practical applications we suggest at least 100 repeated holdout validation and 100 bootstrap samples to be used). We first examined a bidirectional association including both the positive (pwqs) and negative (nwqs) indices. We linked our model to a gaussian distribution to test for relationships between the continuous outcome and exposures (family = "gaussian", other families available within the gwqs function are "binomial", "poisson", "quasipoisson" and "negbin" while the function gwqs_multinom allows to fit a WQS regression for multinomial outcomes), and fixed the seed to 2016 for reproducible results (seed = 2016).

To test the statistical significance of the association between the variables in the model, the following code has to be run as for a classical R regression function:

summary(results2i)
## 
## Call:
## gwqs(formula = yLBX ~ pwqs + nwqs, data = wqs_data, mix_name = PCBs, 
##     rh = 5, b = 5, b1_pos = TRUE, q = 10, validation = 0.6, family = "gaussian", 
##     seed = 2016)
## 
## 
## Coefficients:
##             Estimate Std. Error    2.5 % 97.5 %
## (Intercept) -4.58713    0.32556 -5.22523 -3.949
## pwqs         1.10822    0.10478  0.90286  1.314
## nwqs        -0.06928    0.05128 -0.16979  0.031
## 
## (Mean dispersion parameter for gaussian family taken to be 1.194416)
## 
##     Mean null deviance: 633.40  on 301  degrees of freedom
## Mean residual deviance: 352.68  on 299  degrees of freedom
## 
## Mean AIC:  902.91

From this output we can observe a statistically significant association in the positive (\(\beta\) = 1.108 95%CI 0.903, 1.314) but not in the negative direction (\(\beta\) = -0.069 95%CI -0.170, 0.031). We can also draw some useful plots using the following secondary functions built in the gWQS package:

# bar plot
gwqs_barplot(results2i)
# scatter plot y vs wqs
gwqs_scatterplot(results2i)
# scatter plot residuals vs fitted values
gwqs_fitted_vs_resid(results2i)
# boxplot of the weights estimated at each repeated holdout step
gwqs_boxplot(results2i)