The glmnetUtils
package provides a collection of tools to streamline the process of
fitting elastic net models with glmnet. I wrote the
package after a couple of projects where I found myself writing the same
boilerplate code to convert a data frame into a predictor matrix and a
response vector. In addition to providing a formula interface, it also
features a function `cva.glmnet`

to do crossvalidation for
both \(\alpha\) and \(\lambda\), as well as some utility
functions.

The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here are some simple examples for different types of data:

```
## Call:
## glmnetUtils:::glmnet.formula(formula = mpg ~ cyl + disp + hp,
## data = mtcars)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03326 0.11690 0.41003 1.02839 1.44125 5.05505
```

```
# multinomial logistic regression with specified elastic net alpha parameter
(irisMod <- glmnet(Species ~ ., data=iris, family="multinomial", alpha=0.5))
```

```
## Call:
## glmnetUtils:::glmnet.formula(formula = Species ~ ., data = iris,
## family = "multinomial", alpha = 0.5)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 0.5
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000870 0.0008707 0.0087093 0.0979220 0.0870709 0.8699915
```

```
# Poisson regression with an offset
(InsMod <- glmnet(Claims ~ District + Group + Age, data=MASS::Insurance,
family="poisson", offset=log(Holders)))
```

```
## Call:
## glmnetUtils:::glmnet.formula(formula = Claims ~ District + Group +
## Age, data = MASS::Insurance, family = "poisson", offset = log(Holders))
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02877 0.11613 0.46883 1.40515 1.89269 7.64083
```

Under the hood, glmnetUtils creates a model matrix and response
vector, and passes them to the glmnet package to do the actual model
fitting. A simple `print`

method is also provided, to show
the main model details at a glance. I’ll describe shortly what the
“sparse model matrix” and “use model.frame” options do.

Predicting from a model works as you’d expect: just pass a data frame
containing the new observations to the `predict`

method. You
can also specify any arguments that `predict.glmnet`

accepts.

```
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
# multinomial logistic regression: get predicted class
predict(irisMod, newdata=iris, type="class")
# Poisson regression: need to specify offset
predict(InsMod, newdata=MASS::Insurance, offset=log(Holders))
```

If you want, you can still use the original model matrix-plus-response syntax:

```
mtcarsX <- as.matrix(mtcars[c("cyl", "disp", "hp")])
mtcarsY <- mtcars$mpg
mtcarsMod2 <- glmnet(mtcarsX, mtcarsY)
summary(as.numeric(predict(mtcarsMod, mtcars) -
predict(mtcarsMod2, mtcarsX)))
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
```

This shows that the resulting models are identical, in terms of the predictions they make and the regularisation parameters used.

There are two ways in which glmnetUtils can generate a model matrix
out of a formula and data frame. The first is to use the standard R
machinery comprising `model.frame`

and
`model.matrix`

; and the second is to build the matrix one
variable at a time.

`model.frame`

This is the simpler option, and the one that is most compatible with
other R modelling functions. The `model.frame`

function takes
a formula and data frame and returns a *model frame*: a data
frame with special information attached that lets R make sense of the
terms in the formula. For example, if a formula includes an interaction
term, the model frame will specify which columns in the data relate to
the interaction, and how they should be treated. Similarly, if the
formula includes expressions like `exp(x)`

or
`I(x^2)`

on the RHS, `model.frame`

will evaluate
these expressions and include them in the output.

The major disadvantage of using `model.frame`

is that it
generates a `terms`

object, which encodes how variables and
interactions are organised. One of the attributes of this object is a
matrix with one row per variable, and one column per main effect and
interaction. At minimum, this is (approximately) a \(p \times p\) square matrix where \(p\) is the number of main effects in the
model. For wide datasets with \(p >
10000\), this matrix can approach or exceed a gigabyte in size.
Even if there is enough memory to store such an object, generating the
model matrix can be very slow.

Another issue with the standard R approach is the treatment of
factors. Normally, model.matrix will turn an \(N\)-level factor into an indicator matrix
with \(N-1\) columns, with one column
being dropped. This is necessary for unregularised models as fit with
`lm`

and `glm`

, since the full set of \(N\) columns is linearly dependent. With the
usual treatment
contrasts, the interpretation is that the dropped column represents
a baseline level, while the coefficients for the other columns represent
the difference in the response relative to the baseline.

This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.

To deal with the problems above, glmnetUtils by default will avoid
using `model.frame`

, instead building up the model matrix
term-by-term. This avoids the memory cost of creating a
`terms`

object, and can be noticeably faster than the
standard approach. It will also include one column in the model matrix
for *all* levels in a factor; that is, no baseline level is
assumed. In this situation, the coefficients represent differences from
the overall mean response, and shrinking them to zero *is*
meaningful (usually).

This works in an additive fashion, ie the formula
`~ a + b:c + d*e`

is treated as consisting of three terms,
`a`

, `b:c`

and `d*e`

each of which is
processed independently of the others. A dot in the formula includes all
main effect terms, ie `~ . + a:b + f(x)`

expands to
`~ a + b + x + a:b + f(x)`

(assuming a, b and x are the only
columns in the data). Note that a formula like
`~ (a + b) + (c + d)`

will be treated as two terms,
`a + b`

and `c + d`

.

To examine the speed impact of using `model.frame`

, let’s
do some simple comparisons of run times. We’ll generate sample datasets
with 100, 1,000 and 10,000 predictors, and then run `glmnet`

with both options for generating the model matrix.

```
# generate sample (uncorrelated) data of a given size
makeSampleData <- function(N, P)
{
X <- matrix(rnorm(N*P), nrow=N)
data.frame(y=rnorm(N), X)
}
# test for three sizes: 100/1000/10000 predictors
df1 <- makeSampleData(N=1000, P=100)
df2 <- makeSampleData(N=1000, P=1000)
df3 <- makeSampleData(N=1000, P=10000)
library(microbenchmark)
res <- microbenchmark(
glmnet(y ~ ., df1, use.model.frame=TRUE),
glmnet(y ~ ., df1, use.model.frame=FALSE),
glmnet(y ~ ., df2, use.model.frame=TRUE),
glmnet(y ~ ., df2, use.model.frame=FALSE),
glmnet(y ~ ., df3, use.model.frame=TRUE),
glmnet(y ~ ., df3, use.model.frame=FALSE),
times=10
)
print(res, unit="s", digits=2)
```

```
## Unit: seconds
## expr min lq mean median uq max neval
## glmnet(y ~ ., df1, use.model.frame = TRUE) 0.024 0.024 0.027 0.025 0.029 0.032 10
## glmnet(y ~ ., df1, use.model.frame = FALSE) 0.023 0.026 0.029 0.026 0.028 0.051 10
## glmnet(y ~ ., df2, use.model.frame = TRUE) 3.703 3.916 4.258 4.272 4.428 5.153 10
## glmnet(y ~ ., df2, use.model.frame = FALSE) 3.756 3.874 4.291 4.352 4.561 5.073 10
## glmnet(y ~ ., df3, use.model.frame = TRUE) 11.973 12.353 13.262 13.350 13.864 14.622 10
## glmnet(y ~ ., df3, use.model.frame = FALSE) 4.295 4.639 4.992 4.822 5.060 6.111 10
```

From this, we can see that for datasets with up to 1,000 predictors,
both methods are about as fast as each other. However, for 10,000
predictors (not uncommon these days), the `model.frame`

method takes three times as long as building the model matrix term by
term.

What happens if we take it up to 100,000 predictors? As seen below,
the standard approach of using `model.frame`

fails when R
runs out of memory: the data frame itself is about 800MB in size, but
trying to allocate the `terms`

object requires more then
67GB. However, building the model matrix by term still works, and (on
this machine) finishes in about two minutes.

```
## Warning in terms.formula(formula, data = data): Reached total allocation of
## 32666Mb: see help(memory.size)
## Error: cannot allocate vector of size 37.3 Gb
```

```
## Call:
## glmnet.formula(formula = y ~ ., data = df4, use.model.frame = FALSE)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002393 0.006506 0.017680 0.032470 0.048080 0.130700
```

As an option, glmnetUtils can also generate a *sparse* model
matrix, using the `sparse.model.matrix`

function provided in
the Matrix package. This works exactly the same as a regular model
matrix, but takes up significantly less memory if many of its entries
are zero. A scenario where this is the case would be where many of the
predictors are factors, each with a large number of levels. This can be
combined with both of the previously mentioned options for generating
model matrices.

One piece missing from the standard glmnet package is a way of
choosing \(\alpha\), the elastic net
mixing parameter, similar to how `cv.glmnet`

chooses \(\lambda\), the shrinkage parameter. To fix
this, glmnetUtils provides the `cva.glmnet`

function, which
uses crossvalidation to examine the impact on the model of changing
\(\alpha\) and \(\lambda\). The interface is the same as for
the other functions:

```
# Leukemia dataset from Trevor Hastie's website:
# https://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData
leuk <- do.call(data.frame, Leukemia)
leukMod <- cva.glmnet(y ~ ., data=leuk, family="binomial")
leukMod
```

```
## Call:
## cva.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial")
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
## Number of crossvalidation folds for lambda: 10
```

`cva.glmnet`

uses the algorithm described in the help for
`cv.glmnet`

, which is to fix the distribution of observations
across folds and then call `cv.glmnet`

in a loop with
different values of \(\alpha\).
Optionally, you can parallelise this outer loop, by setting the
`outerParallel`

argument to a non-NULL value. Currently,
glmnetUtils supports the following methods of parallelisation:

- Via
`parLapply`

in the parallel package. To use this, set`outerParallel`

to a valid cluster object created by`makeCluster`

. - Via
`rxExec`

as supplied by the (now retired) RevoScaleR package from Microsoft R Server. To use this, set`outerParallel`

to a valid compute context created by`RxComputeContext`

, or a character string specifying such a context.

If the outer loop is run in parallel, `cva.glmnet`

can
check if the inner loop (over \(\lambda\)) is also set to run in parallel,
and disable this if it would lead to contention for cores.

Because crossvalidation is often a statistically noisy procedure, it doesn’t try to automatically choose \(\alpha\) and \(\lambda\) for you. Instead you can plot the output, to see how the results depend on the values of these parameters. Using this information, you can choose appropriate values for your data.

In this case, we see that values of \(\alpha\) close to \(1\) tend to lead to better accuracy. The
curves don’t have a well-defined minimum, but they do flatten out for
lower values of \(\lambda\). As the
`cv.glmnet`

documentation recommends though, it’s a good idea
to run `cva.glmnet`

multiple times to reduce the impact of
noise.

A `cva.glmnet`

object contains a list of individual
`cv.glmnet`

objects, corresponding to the different \(\alpha\) values tried. This lets you plot
the crossvalidation results easily for a given \(\alpha\):

The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at hongooi73@gmail.com.