The *standardize* package provides tools for controlling continuous variable scaling and factor contrasts. The goal of these standardizations is to keep the regression parameters on similar scales, and to ensure that the intercept (which is the predicted value of an observation when all other coefficients are multiplied by 0) represents the corrected mean (i.e. the predicted value for an observation which is average in every way, holding covariates at their mean values and averaging over group differences in factors). When the predictors are all on a similar scale, there are computational benefits for both frequentist and Bayesian approaches in mixed effects regressions, reasonable Bayesian priors are easier to specify, and regression output is easier to interpret.

Throughout this vignette, we will use the **ptk** dataset to demonstrate the use of the *standardize* package. A summary of the data can be seen below. We will first discuss scaling continuous variables with the **scale** function from base *R*, and with the **scale_by** function from *standardize*. Then we’ll discuss contrasts for unordered and ordered factors with **named_sum_contr** and **scaled_contr_poly** (respectively). Finally, we will use the **standardize** function to quickly use all of these tools at once.

```
library(standardize)
#>
#> ***********************************************************
#> Loading standardize package version 0.2.2
#> Call standardize.news() to see new features/changes
#> ***********************************************************
summary(ptk)
#> cdur vdur place stress prevowel
#> Min. : 61.0 Min. : 0.00 Bilabial:228 Post-Tonic:250 a:217
#> 1st Qu.:117.0 1st Qu.: 49.00 Dental :275 Tonic :252 e:169
#> Median :133.0 Median : 62.00 Velar :248 Unstressed:249 i:146
#> Mean :137.5 Mean : 60.69 o:151
#> 3rd Qu.:155.0 3rd Qu.: 77.00 u: 68
#> Max. :281.0 Max. :190.00
#>
#> posvowel wordpos wordfreq speechrate sex
#> a:212 Initial:322 Min. : 1 Min. : 2.000 Female:384
#> e:144 Medial :429 1st Qu.: 1104 1st Qu.: 5.000 Male :367
#> i: 86 Median : 4899 Median : 6.000
#> o:255 Mean : 18840 Mean : 5.812
#> u: 54 3rd Qu.: 26610 3rd Qu.: 7.000
#> Max. :158168 Max. :10.000
#>
#> speaker
#> s02 : 45
#> s03 : 45
#> s04 : 45
#> s07 : 45
#> s11 : 45
#> s16 : 44
#> (Other):482
```

Continuous variables include covariates (i.e. fixed effects which take on continuous values) and the dependent variable in linear regression. The **scale** function in base *R*, with its default arguments, places continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation (also sometimes called z-scoring or simply scaling). The result is that the values in the transformed variable have the same relationship to one another as in the untransformed variable, but the transformed variable has mean 0 and standard deviation 1. As an example, consider the simple linear regression on total consonant duration *cdur* with *speechrate* as the predictor when we don’t scale either variable:

```
mean(ptk$cdur)
#> [1] 137.5473
sd(ptk$cdur)
#> [1] 30.074
mean(ptk$speechrate)
#> [1] 5.811869
sd(ptk$speechrate)
#> [1] 1.294624
summary(lm(cdur ~ speechrate, ptk))
#>
#> Call:
#> lm(formula = cdur ~ speechrate, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -75.064 -19.688 -4.486 16.514 123.142
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 186.7623 4.7061 39.69 <2e-16 ***
#> speechrate -8.4680 0.7904 -10.71 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 28.02 on 749 degrees of freedom
#> Multiple R-squared: 0.1329, Adjusted R-squared: 0.1317
#> F-statistic: 114.8 on 1 and 749 DF, p-value: < 2.2e-16
```

In the output for the regression on the raw variables, the intercept is 186.8 ms. This is the predicted value when all of the other coefficients are multiplied by 0. In this case, this does not describe anything interpretable, since from the data summary we can see that the minimum speech rate is 2 nuclei per second (and a value of 0 nuclei per second is not possible). The coefficient for speech rate (-8.5) means that for each increase of 1 nucleus per second in speechrate, there is an expected decrease in total consonant duration of 8.5 ms. But is an increase of 1 nucleus per second a large increase? And is a decrease in 8.5 ms a large decrease? This information is not in the output. When we scale *cdur* and *speechrate* so that they both have mean 0 and standard deviation 1, the output becomes easier to interpret:

```
ptk$cdur_scaled <- scale(ptk$cdur)[, 1]
ptk$sr_scaled <- scale(ptk$speechrate)[, 1]
mean(ptk$cdur_scaled)
#> [1] -1.539605e-16
sd(ptk$cdur_scaled)
#> [1] 1
mean(ptk$sr_scaled)
#> [1] -1.573508e-16
sd(ptk$sr_scaled)
#> [1] 1
summary(lm(cdur_scaled ~ sr_scaled, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ sr_scaled, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4960 -0.6547 -0.1492 0.5491 4.0946
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2.164e-16 3.400e-02 0.00 1
#> sr_scaled -3.645e-01 3.402e-02 -10.71 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9318 on 749 degrees of freedom
#> Multiple R-squared: 0.1329, Adjusted R-squared: 0.1317
#> F-statistic: 114.8 on 1 and 749 DF, p-value: < 2.2e-16
```

Looking at the output of the regression on the scaled variables, we see that the R-squared value and F-statistic are unchanged, but now the intercept term is 0 and the estimate for the effect of scaled speech rate is -0.36. In this case, a value of 0 for scaled speech rate indicates the average value, and so the intercept represents the predicted consonant duration on unit scale at an average speech rate. The effect of scaled speech rate now represents the expected change in standard deviations of consonant duration for a 1-standard-deviation increase in speech rate; that is, for a 1-SD increase in speech rate, we expect a 0.36-SD decrease in consonant duration, which we could call an effect of moderate magnitude. When we add more covariates into the model, so long as they are all scaled prior to regression, we can directly compare the effect sizes of the different coefficients with no further math required, since they all represent the expected change in dependent variable standard deviations for a 1-SD increase in the covariate.

In addition to the **scale** function in base *R*, the *standardize* package has the function **scale_by**, which allows a continuous variable to be placed on unit scale within each level of a factor (or the interaction of several factors). For example, say that we are interested in whether or not a speaker’s *relative* speech rate affects their total consonant durations rather than speech rate in general. In other words, some speakers may simply speak more quickly or more slowly in general, and some may exhibit more speech rate variation than others, and we are interested in modeling speech rate relative to the speaker’s tendencies. In this case, we can use the **scale_by** function:

```
ptk$sr_scaled_by_speaker <- scale_by(speechrate ~ speaker, ptk)
mean(ptk$sr_scaled_by_speaker)
#> [1] -3.607564e-17
sd(ptk$sr_scaled_by_speaker)
#> [1] 0.9886017
with(ptk, tapply(speechrate, speaker, mean))
#> s01 s02 s03 s04 s05 s06 s07 s08
#> 5.540706 5.650011 5.773467 6.049278 6.005344 5.619478 6.516664 5.802228
#> s09 s10 s11 s12 s13 s14 s15 s16
#> 5.855702 5.381143 5.702743 5.483161 6.290449 6.353387 5.358924 5.769699
#> s17 s18
#> 5.542354 5.777147
with(ptk, tapply(speechrate, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08
#> 1.084740 1.230526 1.136553 1.261346 1.294189 1.339326 1.242599 1.350230
#> s09 s10 s11 s12 s13 s14 s15 s16
#> 1.472762 1.184024 1.146887 1.160242 1.437768 1.145203 1.432073 1.254818
#> s17 s18
#> 1.357880 1.250270
with(ptk, tapply(sr_scaled, speaker, mean))
#> s01 s02 s03 s04 s05 s06
#> -0.209453180 -0.125023252 -0.029663172 0.183380177 0.149444392 -0.148607756
#> s07 s08 s09 s10 s11 s12
#> 0.544401163 -0.007446973 0.033857642 -0.332704012 -0.084291855 -0.253902532
#> s13 s14 s15 s16 s17 s18
#> 0.369666620 0.418281977 -0.349865920 -0.032573278 -0.208180769 -0.026820301
with(ptk, tapply(sr_scaled, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08
#> 0.8378801 0.9504890 0.8779020 0.9742948 0.9996641 1.0345287 0.9598147 1.0429513
#> s09 s10 s11 s12 s13 s14 s15 s16
#> 1.1375979 0.9145695 0.8858839 0.8961996 1.1105678 0.8845834 1.1061685 0.9692528
#> s17 s18
#> 1.0488603 0.9657396
with(ptk, tapply(sr_scaled_by_speaker, speaker, mean))
#> s01 s02 s03 s04 s05
#> -2.960268e-16 3.165268e-18 -3.119346e-16 1.004790e-16 2.183358e-16
#> s06 s07 s08 s09 s10
#> -7.174906e-17 -1.953684e-16 -2.200950e-16 1.468327e-16 -1.266348e-16
#> s11 s12 s13 s14 s15
#> -2.923792e-16 3.850162e-16 1.980309e-16 6.325185e-17 1.427623e-16
#> s16 s17 s18
#> -2.220446e-16 1.582818e-16 -2.505491e-16
with(ptk, tapply(sr_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
```

The overall mean of *sr_scaled_by_speaker* is still 0, and while the standard deviation is not exactly 1, it is very close at 0.99. When we look at the raw *speechrate* variable by speaker, we can see that speakers differ somewhat in their mean speech rate and in the amount that the deviate from their own mean speech rate. When we look at the *sr_scaled* variable that we created earlier with the base *R* **scale** function, these same differences persist, but now expressed on unit scale in terms of the overall dataset. For *sr_scaled_by_speaker*, however, each speaker’s mean value is 0 and each speaker’s standard deviation is 1, and the variable is interpreted as *how slowly or quickly the speaker was talking given their own speech rate tendencies*.

If we wanted the variables resulting from **scale** and **scale_by** to have a different standard deviation than 1, this can be easily accomplished in the following way:

```
ptk$sr_scaled_0.5 <- scale(ptk$speechrate) * 0.5
ptk$sr_scaled_by_speaker_0.5 <- scale_by(speechrate ~ speaker, ptk, scale = 0.5)
mean(ptk$sr_scaled_0.5)
#> [1] -7.867542e-17
sd(ptk$sr_scaled_0.5)
#> [1] 0.5
with(ptk, tapply(sr_scaled_by_speaker_0.5, speaker, mean))
#> s01 s02 s03 s04 s05
#> -1.480134e-16 1.582634e-18 -1.559673e-16 5.023952e-17 1.091679e-16
#> s06 s07 s08 s09 s10
#> -3.587453e-17 -9.768421e-17 -1.100475e-16 7.341635e-17 -6.331741e-17
#> s11 s12 s13 s14 s15
#> -1.461896e-16 1.925081e-16 9.901544e-17 3.162593e-17 7.138116e-17
#> s16 s17 s18
#> -1.110223e-16 7.914090e-17 -1.252746e-16
with(ptk, tapply(sr_scaled_by_speaker_0.5, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
```

Factors are variables which take on a defined set of categorical values called levels rather than continuous values. In regression, a factor with *K* levels is modeled through the use of *K - 1* dummy variables. Each level of the factor is assigned a value for each dummy variable based on a contrast matrix. So, for example, a factor with four levels has a contrast matrix with four rows (one for each level) and three columns (one for each dummy variable), with the values in the cells of the matrix determining the numerical expression for the factor levels in the dummy variables. There are two general types of factors, ordered and unordered, whose contrasts are treated differently.

Unordered factors take on two or more categorical values which are not intrinsically ordered (or have a somewhat ordered interpretation but there are only two categories, as is sometimes the case with factors coded as false vs. true, 0 vs. 1, or no vs. yes). For unordered factors, the default in *R* is to use treatment contrasts, where the first level is coded as 0 for all of the dummy variables, and the remaining levels each have a dummy variable for which they are coded +1, and are then coded as 0 for the other dummy variables, as can be seen in the following example using *prevowel* (the preceding vowel’s phoneme identity).

```
options(contrasts = c("contr.treatment", "contr.poly"))
contrasts(ptk$prevowel)
#> e i o u
#> a 0 0 0 0
#> e 1 0 0 0
#> i 0 1 0 0
#> o 0 0 1 0
#> u 0 0 0 1
summary(lm(cdur_scaled ~ prevowel, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3844 -0.6886 -0.1139 0.5714 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.06240 0.06765 0.922 0.3566
#> prevowele -0.11995 0.10224 -1.173 0.2411
#> prevoweli 0.09704 0.10667 0.910 0.3632
#> prevowelo -0.22329 0.10561 -2.114 0.0348 *
#> prevowelu -0.10358 0.13850 -0.748 0.4547
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared: 0.01219, Adjusted R-squared: 0.006891
#> F-statistic: 2.301 on 4 and 746 DF, p-value: 0.05722
```

With treatment contrasts, the intercept loses the interpretation of the corrected mean, since when all of the dummy variables in the contrast matrix above are multiplied by 0, the resulting value corresponds to a preceding /a/. The coefficients *prevowele … prevowelu* then represent the difference between each of the other preceding vowel categories and /a/. To avoid this (and to ensure that the coefficients for the dummy variables stay, on average, closer to zero, but without altering the ultimate interpretation of the results), sum contrasts are used. With sum contrasts in *R*, the first *K - 1* levels each get a dummy variable for which they are coded +1, and then are valued 0 for the other dummy variables. The last level is assigned a value of -1 for all of the dummy variables. Sum contrasts also have additional computational benefits in comparison to treatment contrasts for similar reasons as covariate scaling. Recoding the contrasts for *prevowel* with sum contrasts, we get:

```
options(contrasts = c("contr.sum", "contr.poly"))
contrasts(ptk$prevowel)
#> [,1] [,2] [,3] [,4]
#> a 1 0 0 0
#> e 0 1 0 0
#> i 0 0 1 0
#> o 0 0 0 1
#> u -1 -1 -1 -1
summary(lm(cdur_scaled ~ prevowel, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3844 -0.6886 -0.1139 0.5714 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.007554 0.039210 -0.193 0.8473
#> prevowel1 0.069957 0.065448 1.069 0.2855
#> prevowel2 -0.049994 0.071157 -0.703 0.4825
#> prevowel3 0.167001 0.074958 2.228 0.0262 *
#> prevowel4 -0.153338 0.074051 -2.071 0.0387 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared: 0.01219, Adjusted R-squared: 0.006891
#> F-statistic: 2.301 on 4 and 746 DF, p-value: 0.05722
```

With sum contrasts, the intercept maintains the interpretation of the corrected mean, since when all of the dummy variable coefficients are multiplied by 0, it averages over their effects (note that no row in the contrast matrix above has all 0’s, and thus multiplying all of the coefficients by 0 cannot describe any one level; rather, the mean of the values in each column is 0, and so multiplying all of the dummy variable coefficients by 0 averages over their effects). However, one downside to the default implementation is that the coefficients are numbered rather than named. The **named_contr_sum** function from the *standardize* package orders levels alphabetically, applies sum contrasts to them, and names the contrasts:

```
contrasts(ptk$prevowel) <- named_contr_sum(ptk$prevowel)
contrasts(ptk$prevowel)
#> a e i o
#> a 1 0 0 0
#> e 0 1 0 0
#> i 0 0 1 0
#> o 0 0 0 1
#> u -1 -1 -1 -1
summary(lm(cdur_scaled ~ prevowel, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ prevowel, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.3844 -0.6886 -0.1139 0.5714 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.007554 0.039210 -0.193 0.8473
#> prevowela 0.069957 0.065448 1.069 0.2855
#> prevowele -0.049994 0.071157 -0.703 0.4825
#> prevoweli 0.167001 0.074958 2.228 0.0262 *
#> prevowelo -0.153338 0.074051 -2.071 0.0387 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9965 on 746 degrees of freedom
#> Multiple R-squared: 0.01219, Adjusted R-squared: 0.006891
#> F-statistic: 2.301 on 4 and 746 DF, p-value: 0.05722
```

Note that the named output is identical in all other ways to the number output in this case. The intercept is the corrected mean, *prevowela … prevowelo* represent the difference between the named category and the corrected mean, and the /u/ category is obtained by subtracting all four the *prevowel* coefficients from the intercept.

The **named_contr_sum** function also accepts a *scale* argument by which the entire contrast matrix is multiplied. This allows the contrast matrix deviation magnitude to be defined. For example:

```
contrasts(ptk$prevowel) <- named_contr_sum(ptk$prevowel, scale = 0.5)
contrasts(ptk$prevowel)
#> a e i o
#> a 0.5 0.0 0.0 0.0
#> e 0.0 0.5 0.0 0.0
#> i 0.0 0.0 0.5 0.0
#> o 0.0 0.0 0.0 0.5
#> u -0.5 -0.5 -0.5 -0.5
```

One last note on **named_contr_sum** is that if there are only two levels and they are equal to (ignoring case) “F” and “T”, “FALSE” and “TRUE”, “N”, and “Y”, “NO”, and “YES”, or “0” and “1”, then their order is reversed. This makes it so the positive level gets the dummy coefficient rather than the negative level, yielding a more intuitive interpretation for the resulting coefficients. For example, if we were interested in differentiating only between high vowels (/i/ and /u/) and non-high vowels (/e/, /o/, and /a/), we could do the following (note also that *return_contr* can be set to FALSE to create a factor with the contrasts, which is useful when the original variable is not a factor):

```
ptk$prehigh <- ptk$prevowel %in% c("i", "u")
class(ptk$prehigh)
#> [1] "logical"
unique(ptk$prehigh)
#> [1] FALSE TRUE
ptk$prehigh <- named_contr_sum(ptk$prehigh, return_contr = FALSE)
class(ptk$prehigh)
#> [1] "factor"
levels(ptk$prehigh)
#> [1] "TRUE" "FALSE"
contrasts(ptk$prehigh)
#> TRUE
#> TRUE 1
#> FALSE -1
```

Ordered factors are variables which take on more than two categorical values, with the categories representing a hierarchy for which we may not expect the trend to be strictly linear. For example, say we are interested in the effect of preceding vowel height, considering three levels (/a/ = Low, /e o/ = Mid, /i u/ = High). We might hypothesize that the general trend is for relatively higher vowels to have shorter durations than relatively lower vowels, but at the same time not expect that the difference between *Low* and *Mid* is the same as the difference between *Mid* and *High*. In this case we could create an ordered factor *preheight* with levels *Low < Mid < High*. In *R*, the default contrasts for ordered factors are orthogonal polynomial contrasts. That is, if there are *K* factor levels, then the contrast matrix is an orthogonal polynomial of degree *K - 1*, where the first contrast column is the linear component, the second the quadratic, the third the cubic, the fourth the fourth power, etc. until the *K - 1*’th power is reached. Our hypothesis that the trend is in general negative then would be supported by a negative linear trend. This is exemplified in the following code:

```
ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
"Mid", "High"))
head(ptk$preheight)
#> [1] Low High High High Mid High
#> Levels: Low < Mid < High
contrasts(ptk$preheight)
#> .L .Q
#> [1,] -7.071068e-01 0.4082483
#> [2,] -9.073800e-17 -0.8164966
#> [3,] 7.071068e-01 0.4082483
```

However, the scale of the contrast matrix columns depends on the number of levels. The mean of each contrast column is always 0, and the columns of the contrast matrix are completely uncorrelated, but the standard deviations of the contrast matrix columns decreases as the number of levels increases:

```
contr3 <- contr.poly(3)
contr5 <- contr.poly(5)
apply(contr3, 2, mean)
#> .L .Q
#> -6.723860e-17 3.700743e-17
apply(contr5, 2, mean)
#> .L .Q .C ^4
#> -6.591949e-18 2.219362e-17 4.148700e-17 2.783689e-18
apply(contr3, 2, sd)
#> .L .Q
#> 0.7071068 0.7071068
apply(contr5, 2, sd)
#> .L .Q .C ^4
#> 0.5 0.5 0.5 0.5
```

For this reason, the *standardize* package has a function **scaled_contr_poly** where the standard deviation of the contrast matrix columns for ordered factors can be specified through its **scale** argument (with default 1):

```
sc_1_contr3 <- scaled_contr_poly(3)
sc_0.5_contr3 <- scaled_contr_poly(3, scale = 0.5)
sc_1_contr3
#> .L .Q
#> [1,] -1.00000e+00 0.5773503
#> [2,] -3.32076e-17 -1.1547005
#> [3,] 1.00000e+00 0.5773503
apply(sc_1_contr3, 2, sd)
#> .L .Q
#> 1 1
sc_0.5_contr3
#> .L .Q
#> [1,] -5.00000e-01 0.2886751
#> [2,] -1.66038e-17 -0.5773503
#> [3,] 5.00000e-01 0.2886751
apply(sc_0.5_contr3, 2, sd)
#> .L .Q
#> 0.5 0.5
```

The resulting contrasts are still orthogonal polynomial contrasts; they are simply placed on a uniform scale regardless of the number of factor levels. This affects the magnitude of the resulting coefficients, but, as with the standardization of unordered factors and continuous variables discussed above, it does not alter the significance of the variable:

```
contrasts(ptk$preheight)
#> .L .Q
#> [1,] -7.071068e-01 0.4082483
#> [2,] -9.073800e-17 -0.8164966
#> [3,] 7.071068e-01 0.4082483
summary(lm(cdur_scaled ~ preheight, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ preheight, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4390 -0.7099 -0.1139 0.5536 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.01726 0.03702 0.466 0.641
#> preheight.L 0.02354 0.06792 0.347 0.729
#> preheight.Q 0.15135 0.06007 2.519 0.012 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.997 on 748 degrees of freedom
#> Multiple R-squared: 0.008562, Adjusted R-squared: 0.005911
#> F-statistic: 3.23 on 2 and 748 DF, p-value: 0.04011
contrasts(ptk$preheight) <- scaled_contr_poly(ptk$preheight)
contrasts(ptk$preheight)
#> .L .Q
#> Low -1.00000e+00 0.5773503
#> Mid -3.32076e-17 -1.1547005
#> High 1.00000e+00 0.5773503
summary(lm(cdur_scaled ~ preheight, ptk))
#>
#> Call:
#> lm(formula = cdur_scaled ~ preheight, data = ptk)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.4390 -0.7099 -0.1139 0.5536 4.7076
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.01726 0.03702 0.466 0.641
#> preheight.L 0.01665 0.04803 0.347 0.729
#> preheight.Q 0.10702 0.04248 2.519 0.012 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.997 on 748 degrees of freedom
#> Multiple R-squared: 0.008562, Adjusted R-squared: 0.005911
#> F-statistic: 3.23 on 2 and 748 DF, p-value: 0.04011
```

The **standardize** function implements **scale**, **scale_by**, **named_contr_sum**, and **scaled_contr_poly** automatically, allowing regressions to be easily fit in a standardized space. It takes the following arguments:

- formula: A regression formula, possibly containing random effects and/or offsets.
- data: A data.frame containing the variables in the formula.
- family: A regression family (defaults to gaussian)
- scale: The desired scale for the predictors (defaults to 1).
- offset: An optional vector of offsets.

The function first calls **model.frame**. If **family** is gaussian (i.e. a linear model), then the response (i.e. dependent variable) is placed on unit scale (mean 0 and standard deviation 1) regardless of what the **scale** argument to **standardize** is; if the response contains a call to **scale_by**, then it is placed on unit scale within each level of the conditioning factor. For offsets (again, if **family* is gaussian), the values are divided by the standard deviation of the response variable prior to scaling (within-factor-level if **scale_by** is used on the response). Then, for all values of **family**, random effects groups (if any) are coerced to unordered factors, and any other predictors which are characters or which contain only two unique non-NA values (regardless of their class) are converted to unordered factors and assigned contrasts with **named_contr_sum**, passing along the **scale** argument. Ordered factors are assigned contrasts with **scaled_contr_poly**, passing along the **scale** argument. Continuous predictors which contain a call to **scale_by** are re-calculated passing along the **scale** argument. Finally, continuous predictors which do not contain a call to **scale_by** are scaled using the **scale** function, ensuring that the resulting variable has standard deviation equal to the **scale** argument to **standardize**.

The column names of the model frame are then renamed so that they are valid variable names, and the formula passed to **standardize** is updated with these variable names. A list with class **standardized** is then returned with the following elements:

- call: The call to
**standardize**which created the object. - scale: The
**scale**argument to**standardize**. - formula: The regression formula in standardized space (with new names) which can be used along with the
**data**element to fit regressions. - family: The regression family.
- data: A data frame containing the regression variables in the standardized space (renamed to have valid variable names corresponding to those in the
**formula**element). - pred: A list containing unevaluated function calls that allows the predict method to work.
- offset: If the
**offset**argument to**standardize**was used, then it is stored in the standardized object’s offset element (and will be scaled as described above for linear regression). - variables: A data frame with the name of the original variable, the corresponding name in the standardized data frame and formula, and the class of the variable in the standardized data frame.
- contrasts: A named list of contrasts for all factors in the standardized data frame (or NULL if the regression contains no factors).
- groups: A named list of levels for random effects grouping factors (or NULL if the regression contains no random effects).

To illustrate the use of the **standardize** function, we will fit a linear mixed effects regression with **lmer** in the *lme4* package with *cdur* as the response, *place*, *stress*, *preheight*, the natural log of *wordfreq*, and *speecrhate* scaled by *speaker* as fixed effects, and random intercepts for *speaker*. We begin by creating *preheight* and then calling **standardize**:

```
ptk$preheight <- "Mid"
ptk$preheight[ptk$prevowel == "a"] <- "Low"
ptk$preheight[ptk$prevowel %in% c("i", "u")] <- "High"
ptk$preheight <- factor(ptk$preheight, ordered = TRUE, levels = c("Low",
"Mid", "High"))
sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
scale_by(speechrate ~ speaker) + (1 | speaker), ptk)
```

Now lets examine the *sobj* object created by **standardize**:

```
is.standardized(sobj)
#> [1] TRUE
sobj
#>
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) +
#> scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> (1 | speaker)
#>
#> Variables:
#> Variable Standardized Name Class
#> cdur cdur response.numeric
#> place place factor
#> stress stress factor
#> preheight preheight ordered
#> log(wordfreq) log_wordfreq numeric
#> scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker scaledby
#> speaker speaker group
#>
#> Response has mean 0 and standard deviation 1.
#>
#> Continuous variables have mean 0 and standard deviation 1
#> (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 1
#> Ordered factors have orthogonal polynomial contrasts whose
#> columns have standard deviation 1
#> Grouping factors are coded as unordered factors with default contrasts
names(sobj)
#> [1] "call" "scale" "formula" "family" "data" "offset"
#> [7] "pred" "variables" "contrasts" "groups"
head(sobj$data)
#> cdur place stress preheight log_wordfreq
#> 1 0.58032620 Velar Unstressed Low -1.64753617
#> 2 0.21456174 Velar Post-Tonic High -1.64753617
#> 3 -0.11795140 Dental Unstressed High -0.05570344
#> 4 0.01505386 Dental Post-Tonic High -1.90076049
#> 5 0.91283934 Velar Tonic Mid -0.19203879
#> 6 1.51136300 Dental Post-Tonic High -0.22152148
#> speechrate_scaled_by_speaker speaker
#> 1 -0.4984662 s01
#> 2 1.3452937 s01
#> 3 -0.4984662 s01
#> 4 0.6454974 s01
#> 5 2.2671736 s01
#> 6 -2.3422261 s01
mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1
mean(sobj$data$log_wordfreq)
#> [1] 1.550764e-16
sd(sobj$data$log_wordfreq)
#> [1] 1
all.equal(scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#> s01 s02 s03 s04 s05
#> -2.960268e-16 3.165268e-18 -3.119346e-16 1.004790e-16 2.183358e-16
#> s06 s07 s08 s09 s10
#> -7.174906e-17 -1.953684e-16 -2.200950e-16 1.468327e-16 -1.266348e-16
#> s11 s12 s13 s14 s15
#> -2.923792e-16 3.850162e-16 1.980309e-16 6.325185e-17 1.427623e-16
#> s16 s17 s18
#> -2.220446e-16 1.582818e-16 -2.505491e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
sobj$contrasts
#> $place
#> Bilabial Dental
#> Bilabial 1 0
#> Dental 0 1
#> Velar -1 -1
#>
#> $stress
#> Post-Tonic Tonic
#> Post-Tonic 1 0
#> Tonic 0 1
#> Unstressed -1 -1
#>
#> $preheight
#> .L .Q
#> Low -1.00000e+00 0.5773503
#> Mid -3.32076e-17 -1.1547005
#> High 1.00000e+00 0.5773503
sobj$groups
#> $speaker
#> [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11" "s12"
#> [13] "s13" "s14" "s15" "s16" "s17" "s18"
```

We can see that all of the regression variables have been placed on a similar scale, using the default **scale** argument of 1. The majority of the predictors did not change name, but those which included function calls (*log(wordfreq)* and *scale_by(speechrate ~ speaker)*) have been altered so that they are valid variable names. If we were to call **standardize** with **scale = 0.5**, then *cdur* would still have mean 0 and standard deviation 1, but the predictors would all have scale 0.5:

```
sobj <- standardize(cdur ~ place + stress + preheight + log(wordfreq) +
scale_by(speechrate ~ speaker) + (1 | speaker), ptk, scale = 0.5)
sobj
#>
#> Call:
#> standardize(formula = cdur ~ place + stress + preheight + log(wordfreq) +
#> scale_by(speechrate ~ speaker) + (1 | speaker), data = ptk,
#> scale = 0.5)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Standardized Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> (1 | speaker)
#>
#> Variables:
#> Variable Standardized Name Class
#> cdur cdur response.numeric
#> place place factor
#> stress stress factor
#> preheight preheight ordered
#> log(wordfreq) log_wordfreq numeric
#> scale_by(speechrate ~ speaker) speechrate_scaled_by_speaker scaledby
#> speaker speaker group
#>
#> Response has mean 0 and standard deviation 1.
#>
#> Continuous variables have mean 0 and standard deviation 0.5
#> (within-factor-level if scale_by was used)
#> Unordered factors have sum contrasts with deviation 0.5
#> Ordered factors have orthogonal polynomial contrasts whose
#> columns have standard deviation 0.5
#> Grouping factors are coded as unordered factors with default contrasts
names(sobj)
#> [1] "call" "scale" "formula" "family" "data" "offset"
#> [7] "pred" "variables" "contrasts" "groups"
head(sobj$data)
#> cdur place stress preheight log_wordfreq
#> 1 0.58032620 Velar Unstressed Low -0.82376808
#> 2 0.21456174 Velar Post-Tonic High -0.82376808
#> 3 -0.11795140 Dental Unstressed High -0.02785172
#> 4 0.01505386 Dental Post-Tonic High -0.95038025
#> 5 0.91283934 Velar Tonic Mid -0.09601939
#> 6 1.51136300 Dental Post-Tonic High -0.11076074
#> speechrate_scaled_by_speaker speaker
#> 1 -0.2492331 s01
#> 2 0.6726468 s01
#> 3 -0.2492331 s01
#> 4 0.3227487 s01
#> 5 1.1335868 s01
#> 6 -1.1711131 s01
mean(sobj$data$cdur)
#> [1] -1.539605e-16
sd(sobj$data$cdur)
#> [1] 1
mean(sobj$data$log_wordfreq)
#> [1] 7.753821e-17
sd(sobj$data$log_wordfreq)
#> [1] 0.5
all.equal(0.5 * scale(log(ptk$wordfreq))[, 1], sobj$data$log_wordfreq[, 1])
#> [1] TRUE
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, mean))
#> s01 s02 s03 s04 s05
#> -1.480134e-16 1.582634e-18 -1.559673e-16 5.023952e-17 1.091679e-16
#> s06 s07 s08 s09 s10
#> -3.587453e-17 -9.768421e-17 -1.100475e-16 7.341635e-17 -6.331741e-17
#> s11 s12 s13 s14 s15
#> -1.461896e-16 1.925081e-16 9.901544e-17 3.162593e-17 7.138116e-17
#> s16 s17 s18
#> -1.110223e-16 7.914090e-17 -1.252746e-16
with(sobj$data, tapply(speechrate_scaled_by_speaker, speaker, sd))
#> s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18
#> 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
sobj$contrasts
#> $place
#> Bilabial Dental
#> Bilabial 0.5 0.0
#> Dental 0.0 0.5
#> Velar -0.5 -0.5
#>
#> $stress
#> Post-Tonic Tonic
#> Post-Tonic 0.5 0.0
#> Tonic 0.0 0.5
#> Unstressed -0.5 -0.5
#>
#> $preheight
#> .L .Q
#> Low -5.00000e-01 0.2886751
#> Mid -1.66038e-17 -0.5773503
#> High 5.00000e-01 0.2886751
sobj$groups
#> $speaker
#> [1] "s01" "s02" "s03" "s04" "s05" "s06" "s07" "s08" "s09" "s10" "s11" "s12"
#> [13] "s13" "s14" "s15" "s16" "s17" "s18"
```

The **standardized** object can be used to fit regression models, and the resulting regression model can then be used with functions from other packages such as *lme4*, *afex*, and *emmeans* with a few caveats.

We fit the mixed effects regression using the **standardized** object with **scale = 0.5** by simply passing its formula and data elements to the **lmer** function:

```
library(lme4)
#> Loading required package: Matrix
#> Warning: package 'Matrix' was built under R version 3.6.3
mod <- lmer(sobj$formula, sobj$data)
summary(mod)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula:
#> cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> (1 | speaker)
#> Data: sobj$data
#>
#> REML criterion at convergence: 2006.9
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -2.5479 -0.6919 -0.1014 0.5883 4.2981
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> speaker (Intercept) 0.09375 0.3062
#> Residual 0.79081 0.8893
#> Number of obs: 751, groups: speaker, 18
#>
#> Fixed effects:
#> Estimate Std. Error t value
#> (Intercept) 0.01576 0.07946 0.198
#> placeBilabial 0.01651 0.09553 0.173
#> placeDental -0.02341 0.09183 -0.255
#> stressPost-Tonic -0.12491 0.09681 -1.290
#> stressTonic 0.42435 0.09595 4.423
#> preheight.L 0.03529 0.09331 0.378
#> preheight.Q 0.11910 0.07848 1.518
#> log_wordfreq -0.07557 0.06763 -1.117
#> speechrate_scaled_by_speaker -0.61373 0.06643 -9.239
#>
#> Correlation of Fixed Effects:
#> (Intr) plcBlb plcDnt strP-T strssT prhg.L prhg.Q lg_wrd
#> placeBilabl 0.024
#> placeDental -0.039 -0.500
#> strssPst-Tn 0.004 0.053 -0.051
#> stressTonic -0.006 0.003 -0.003 -0.524
#> preheight.L 0.009 0.075 -0.106 -0.272 0.276
#> preheight.Q 0.083 -0.041 -0.127 0.029 -0.059 0.023
#> log_wordfrq 0.005 -0.092 0.062 -0.131 -0.029 0.106 0.093
#> spchrt_sc__ 0.007 0.015 -0.043 0.058 0.046 0.061 0.073 0.013
```

When predicting new data with the regression model, the new data first needs to be placed into the same standardized space as *sobj$data*. This can be done by calling **predict** with the **standardied** object as the first argument. The **predict** method for the **standardized** class also takes logical arguments **response** (whether or not the new data contains the response variable; default FALSE), **fixed** (whether or not the new data contains the fixed effects variables; default TRUE), and **random** (whether or not the new data contains the random effects variables; default TRUE). We will illustrate the use of the **standardized predict** method using the original data **ptk** as if it were new data:

```
newdata <- predict(sobj, ptk)
newdata_fe <- predict(sobj, ptk, random = FALSE)
newdata_re <- predict(sobj, ptk, fixed = FALSE)
head(newdata)
#> place stress preheight log_wordfreq speechrate_scaled_by_speaker speaker
#> 1 Velar Unstressed Low -0.82376808 -0.2492331 s01
#> 2 Velar Post-Tonic High -0.82376808 0.6726468 s01
#> 3 Dental Unstressed High -0.02785172 -0.2492331 s01
#> 4 Dental Post-Tonic High -0.95038025 0.3227487 s01
#> 5 Velar Tonic Mid -0.09601939 1.1335868 s01
#> 6 Dental Post-Tonic High -0.11076074 -1.1711131 s01
head(newdata_fe)
#> place stress preheight log_wordfreq speechrate_scaled_by_speaker
#> 1 Velar Unstressed Low -0.82376808 -0.2492331
#> 2 Velar Post-Tonic High -0.82376808 0.6726468
#> 3 Dental Unstressed High -0.02785172 -0.2492331
#> 4 Dental Post-Tonic High -0.95038025 0.3227487
#> 5 Velar Tonic Mid -0.09601939 1.1335868
#> 6 Dental Post-Tonic High -0.11076074 -1.1711131
head(newdata_re)
#> speaker
#> 1 s01
#> 2 s01
#> 3 s01
#> 4 s01
#> 5 s01
#> 6 s01
```

This standardized new data can then be used as the **newdata** argument to **predict** with the regression model as the first argument. The **predict** method may generate warnings about contrasts being dropped, but these warnings can be ignored (i.e. the predictions are still correct; warnings shouldn’t occur with the latest version of lme4, but may occur for older versions, and will occur for the **predict** method for fixed-effects-only models):

```
# predictions using both the fixed and random effects
preds <- predict(mod, newdata = newdata)
all.equal(preds, fitted(mod))
#> [1] TRUE
# predictions using only the fixed effects
preds_fe <- predict(mod, newdata = newdata_fe, re.form = NA)
head(preds)
#> 1 2 3 4 5 6
#> 0.36567215 -0.07756456 0.32566494 0.13159718 -0.26161147 0.98498387
head(preds_fe)
#> 1 2 3 4 5 6
#> 0.10143783 -0.34179887 0.06143062 -0.13263714 -0.52584579 0.72074955
```

When obtaining p-values for the predictors with the **mixed** function in the *afex* package, the **check_contrasts** argument should be set to FALSE to ensure that the correct contrasts are used (this shouldn’t matter when a regression model is passed to **mixed**, but if the formula and data elements of the **standardized** object are passed directly to **mixed**, it is necessary; it is best to just always specify **check_contrasts = FALSE**):

```
library(afex)
#> Registered S3 methods overwritten by 'car':
#> method from
#> influence.merMod lme4
#> cooks.distance.influence.merMod lme4
#> dfbeta.influence.merMod lme4
#> dfbetas.influence.merMod lme4
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
#> ************
#> Welcome to afex. For support visit: http://afex.singmann.science/
#> - Functions for ANOVAs: aov_car(), aov_ez(), and aov_4()
#> - Methods for calculating p-values with mixed(): 'KR', 'S', 'LRT', and 'PB'
#> - 'afex_aov' and 'mixed' objects can be passed to emmeans() for follow-up tests
#> - NEWS: library('emmeans') now needs to be called explicitly!
#> - Get and set global package options with: afex_options()
#> - Set orthogonal sum-to-zero contrasts globally: set_sum_contrasts()
#> - For example analyses see: browseVignettes("afex")
#> ************
#>
#> Attaching package: 'afex'
#> The following object is masked from 'package:lme4':
#>
#> lmer
pvals <- mixed(mod, data = sobj$data, check_contrasts = FALSE)
#> Formula (the first argument) converted to formula.
#> Fitting one lmer() model. [DONE]
#> Calculating p-values. [DONE]
pvals
#> Mixed Model Anova Table (Type 3 tests, KR-method)
#>
#> Model: cdur ~ place + stress + preheight + log_wordfreq + speechrate_scaled_by_speaker +
#> Model: (1 | speaker)
#> Data: $
#> Data: sobj
#> Data: data
#> Effect df F p.value
#> 1 place 2, 726.63 0.03 .967
#> 2 stress 2, 726.38 10.51 *** <.001
#> 3 preheight 2, 732.84 1.21 .299
#> 4 log_wordfreq 1, 731.67 1.25 .264
#> 5 speechrate_scaled_by_speaker 1, 725.10 85.35 *** <.001
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

The **emmeans** function in the *emmeans* package can be called as it would normally be called, but note that if the regression model contains polynomial covariates (i.e. if the formula passed to **standardize** includes calls to the **poly** function), then the results may be misleading (as is often the case). We will illustrate the use of **emmeans** with the **stress** factor (since the p-value for the overall factor was <.0001):

```
library(emmeans)
stress_comparisons <- emmeans(mod, pairwise ~ stress)
stress_comparisons
#> $emmeans
#> stress emmean SE df lower.CL upper.CL
#> Post-Tonic -0.0467 0.0932 32.5 -0.2365 0.1431
#> Tonic 0.2279 0.0926 31.6 0.0392 0.4166
#> Unstressed -0.1340 0.0924 31.4 -0.3223 0.0544
#>
#> Results are averaged over the levels of: place, preheight
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> (Post-Tonic) - Tonic -0.2746 0.0842 727 -3.263 0.0033
#> (Post-Tonic) - Unstressed 0.0873 0.0825 726 1.058 0.5406
#> Tonic - Unstressed 0.3619 0.0817 726 4.428 <.0001
#>
#> Results are averaged over the levels of: place, preheight
#> Degrees-of-freedom method: kenward-roger
#> P value adjustment: tukey method for comparing a family of 3 estimates
```

The *standardize* package provides several functions which aid in placing regression variables on similar scales, namely **scale_by**, **named_contr_sum**, and **scaled_contr_poly**. The **standardize** function offers a convenient way to make use of these functions (along with the **scale** function in base *R*) automatically, resulting in a **standardized** object which contains a standardized formula and data frame that can be passed to regression fitting functions. The most common use of the *standardize* package is to call the **standardize** function, passing the same **formula**, **data**, and **family** arguments as would normally be passed to a regression fitting function, leaving the **scale** argument at its default, and then passing the **formula** and **data** elements of the **standardized** object on to the regression fitting function with all other options for the regression fitting function specified as they would normally be specified.

If you use the *standardize* package in a publication, please cite:

`Eager, Christopher D. (2017). standardize: Tools for Standardizing Variables for Regression in R. R package version 0.2.1. https://CRAN.R-project.org/package=standardize`

If you analyze the **ptk** dataset in a publication, please cite:

`Eager, Christopher D. (2017). Contrast preservation and constraints on individual phonetic variation. Doctoral thesis. University of Illinois at Urbana-Champaign.`