\RequirePackage{fix-cm}
\documentclass[10pt]{scrartcl}
%%\VignetteIndexEntry{mboost Tutorial}
%%\VignetteDepends{mboost,lattice,RColorBrewer,TH.data}
\usepackage{threeparttable} % needed for \tnote (footnotes in table)
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{color}
\usepackage{xcolor}
\usepackage{amsmath, amssymb, amstext}
\usepackage{graphicx}
\usepackage{dsfont}
\usepackage{subfigure}
\usepackage{url}
\usepackage{bm} % bold Greek letters. Usage $\bm{\beta}$
\usepackage{booktabs}
\usepackage[authoryear]{natbib}
\usepackage{hyperref}
\definecolor{Blue}{rgb}{0,0,0.5}
\hypersetup{%
pdftitle = {Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost.},
pdfsubject = {},
pdfkeywords = {},
pdfauthor = {Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Mattthias Schmid},
%% change colorlinks to false for pretty printing
colorlinks = {true},
linkcolor = {Blue},
citecolor = {Blue},
urlcolor = {Blue},
hyperindex = {true},
linktocpage = {true},
}
\bibpunct{(}{)}{;}{a}{}{,}
\def\citeapos#1{\citeauthor{#1}'s (\citeyear{#1})} % possesive citation
\def\citepos#1{\citeauthor{#1}' (\citeyear{#1})} % possesive citation
\usepackage{enumerate} % needed for "[i]" as enumeration style etc.
\usepackage{ifpdf}
%%%%%%%%%%%%%%%%%%
% Newcommands %
%%%%%%%%%%%%%%%%%%
\newcommand{\mbf}[1]{\bm{#1}}
\newcommand{\eps}{\ensuremath{\varepsilon}}
\newcommand{\R}[1]{\texttt{#1}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\K}{\mathbf{K}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\uv}{\mathbf{u}}
\newcommand{\bv}{\mbf{\beta}}
\DeclareMathOperator{\df}{df}
\DeclareMathOperator{\argmin}{argmin}
%%%%%%%%%%%%%
% Layout %
%%%%%%%%%%%%%
\usepackage{geometry}
\geometry{verbose,a4paper,tmargin=2.2cm,bmargin=2.2cm,lmargin=2.2cm,rmargin=2.2cm}
%%%%%%%%%%%%%
% Sweave %
%%%%%%%%%%%%%
%% \usepackage{Sweave} is essentially
\RequirePackage[T1]{fontenc}
\RequirePackage{ae,fancyvrb}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,fontsize=\normalsize}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\normalsize}
\DefineVerbatimEnvironment{Scode}{Verbatim}{fontshape=sl,fontsize=\normalsize}
\newenvironment{Schunk}{}{}
\SweaveOpts{echo=true, results=verbatim, prefix.string=graphics/fig, keep.source=true}
<>=
## set up R session
options(prompt = "R> ", continue = "+ ", width = 80)
pd <- packageDescription("mboost")
if (any(is.na(pd))){
install.packages("mboost", repos = "http://cran.at.r-project.org")
pd <- packageDescription("mboost")
}
if (compareVersion(pd$Version, "2.1-0") < 0){ # must be mboost 2.1-X or newer
warning("Current version of mboost is installed!")
install.packages("mboost", repos = "http://cran.at.r-project.org")
}
options(digits = 3)
require("mboost")
set.seed(190781)
### make graphics directory if not existing
if (!file.exists("graphics"))
dir.create("graphics")
@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Begin Document %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\title{Model-based Boosting in \textsf{R}}
\subtitle{A Hands-on Tutorial Using the \textsf{R} Package mboost}
\author{Benjamin Hofner\thanks{E-mail: {benjamin.hofner@imbe.med.uni-erlangen.de}}
\thanks{Department of Medical Informatics, Biometry and Epidemiology,
Friedrich-Alexander-Universit\"at Erlangen-N\"urnberg} \and
Andreas Mayr\footnotemark[2] \and Nikolay Robinzonov\thanks{Department of
Statistics, Ludwig-Maximilians-Universit\"at M\"unchen} \and Matthias
Schmid\footnotemark[2]\\[1em]
}
\date{}
\maketitle
\begin{center}
\small
\textbf{This is an extended and (slightly) modified version of the manuscript}\\
Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Mattthias Schmid (2014),
Model-based Boosting in \textsf{R} -- A Hands-on Tutorial Using the \textsf{R}
Package mboost. \emph{Computational Statistics}, 29:3--35.\\
DOI \href{http://dx.doi.org/10.1007/s00180-012-0382-5}{10.1007/s00180-012-0382-5}.\\
The final publication is available at \url{http://link.springer.com}.\\[2em]
Changes to the results in the original manuscript are due to changes in some
defaults of \textbf{mboost} (most notably the definition of degrees of freedom
has changed in \textbf{mboost} 2.2-0). See NEWS for details.
\end{center}
\bigskip
\begin{abstract}
We provide a detailed hands-on tutorial for the \textsf{R} add-on package
\textbf{mboost}. The package implements boosting for optimizing general risk
functions utilizing component-wise (penalized) least squares estimates as
base-learners for fitting various kinds of generalized linear and generalized
additive models to potentially high-dimensional data. We give a theoretical
background and demonstrate how \textbf{mboost} can be used to fit
interpretable models of different complexity. As an example we use
\textbf{mboost} to predict the body fat based on anthropometric measurements
throughout the tutorial.
\end{abstract}
\section{Introduction}
\label{sec:intro}
A key issue in statistical research is the development of algorithms for model
building and variable selection \citep{elements2}. Due to the recent advances in
computational research and biotechnology, this issue has become even more
relevant \citep{Fan2010}. For example, microarray and DNA sequencing experiments
typically result in high-dimensional data sets with large numbers of predictor
variables but relatively small sample sizes. In these experiments, it is usually
of interest to build a good prognostic model using a small subset of marker
genes. Hence, there is a need for statistical techniques to select the most
informative features out of a large set of predictor variables. A related
problem is to select the appropriate modeling alternative for each of the
covariates \citep[``model choice'', ][]{Kneib:Hothorn:Tutz:modelchoice:2009}.
This problem even arises when data sets are not high-dimensional. For example, a
continuous covariate could be included into a statistical model using linear,
non-linear or interaction effects with other predictor variables.
In order to address these issues, a variety of regression techniques has been
developed during the past years (see, e.g., \citealt{elements2}). This progress
in statistical methodology is primarily due to the fact that classical
techniques for model building and variable selection (such as generalized linear
modeling with stepwise selection) are known to be unreliable or might even be
biased. In this tutorial, we consider \emph{component-wise gradient boosting}
(\citealt{Breiman98,Breiman99}, \citealt{friedmanetal2000},
\citealt{friedman2001}), which is a machine learning method for optimizing
prediction accuracy and for obtaining statistical model estimates via gradient
descent techniques. A key feature of the method is that it carries out variable
selection during the fitting process (\citealt{BuehlmannYu2003},
\citealt{Buehlmann2006}) without relying on heuristic techniques such as
stepwise variable selection. Moreover, gradient boosting algorithms result in
prediction rules that have the same interpretation as common statistical model
fits. This is a major advantage over machine learning methods such as random
forests (\citealt{rf}) that result in non-interpretable ``black-box''
predictions.
In this tutorial, we describe the \textsf{R} \citep{r-core:2012} add-on package
\textbf{mboost} \citep{Hothorn+Buehlmann+Kneib+Schmid+Hofner:mboost:2010,
pkg:mboost:CRAN:2.1}, which implements methods to fit generalized linear
models (GLMs), generalized additive models (GAMs, \citealt{hastietib}), and
generalizations thereof using component-wise gradient boosting techniques. The
\textbf{mboost} package can thus be used for regression, classification,
time-to-event analysis, and a variety of other statistical modeling problems
based on potentially high-dimensional data. Because of its user-friendly formula
interface, \textbf{mboost} can be used in a similar way as classical functions
for statistical modeling in \textsf{R}. In addition, the package implements
several convenience functions for hyper-parameter selection, parallelization of
computations, and visualization of results.
The rest of the paper is organized as follows: In Section \ref{sec:theory}, we
provide a brief theoretical overview of component-wise gradient boosting and its
properties. In Section \ref{sec:mboost}, the \textbf{mboost} package is
presented in detail. Here we present the infrastructure of the package and show
how to use \textbf{mboost} to obtain interpretable statistical model fits. All
steps are illustrated using a clinical data set that was collected by
\citet{garcia}. The authors conducted a study to predict the body fat content of
study participants by means of common anthropometric measurements. The response
variable was obtained by Dual X-Ray Absorptiometry (DXA). DXA is an accurate but
expensive method for measuring the body fat content. In order to reduce costs,
it is therefore desirable to develop a regression equation for predicting DXA
measurements by means of anthropometric measurements which are easy to obtain.
Using the \textbf{mboost} package, we provide a step-by-step illustration on how
to use gradient boosting to fit a prediction model for body fat. A summary of
the paper is given in Section~\ref{sec:summary}.
\section{A Brief Theoretical Overview of Component-Wise
Gradient Boosting}
\label{sec:theory}
Throughout the paper, we consider data sets containing the values of an outcome
variable $y$ and some predictor variables $x_1,\dots,x_p$. For example, in case
of the body fat data, $y$ is the body fat content of the study participants
(measured using DXA) and $x_1,\dots,x_p$ represent the following anthropometric
measurements: age in years, waist circumference, hip circumference, breadth of
the elbow, breadth of the knee, and four aggregated predictor variables that
were obtained from other anthropometric measurements.
The aim is to model the relationship between $y$ and
$\x:=(x_1,\dots,x_p)^{\top}$, and to obtain an ``optimal''
prediction of $y$ given $\x$. This is accomplished by minimizing a loss
function $\rho (y,f) \in \mathbb{R}$ over a prediction function~$f$
(depending on $\x$). Usually, for GLMs and GAMs, the loss function is simply
the negative log-likelihood function of the outcome distribution. Linear
regression with a continuous outcome variable $y \in\mathbb{R}$ is a
well-known example of this approach: Here, $\rho$ corresponds to the least
squares objective function (which is equivalent to the negative log-likelihood
of a Gaussian model), and $f$ is a parametric (linear) function of $\x$.
In the gradient boosting framework, the aim is to estimate the optimal
prediction function~$f^*$ that is defined by
\begin{eqnarray}\label{thmean}
&& f^{*} := \argmin_{f} \mathbb{E}_{Y,\X} \big[\rho (y, f(\x^\top)) \big] \ ,
\end{eqnarray}
where the loss function $\rho$ is assumed to be differentiable with respect to
$f$. In practice, we usually deal with realizations $(y_i, \x_i^\top)$,
$i=1,\dots,n$, of $(y, \x^{\top})$, and the expectation in (\ref{thmean}) is
therefore not known. For this reason, instead of minimizing the expected value
given in (\ref{thmean}), boosting algorithms minimize the observed mean
$\mathcal{R} := \sum_{i=1}^n \rho(y_i,f(\x_i^\top))$ (also called the
``empirical risk''). The following algorithm (``component-wise gradient
boosting'') is used
to minimize $\mathcal{R}$ over $f$:\\
\begin{enumerate}
\item Initialize the function estimate $\hat{f}^{[0]}$ with offset values. Note
that $\hat{f}^{[0]}$ is a vector of length $n$. In the following paragraphs,
we will generally denote the vector of function estimates at iteration $m$ by
$\hat{f}^{[m]}$.
\item Specify a set of \emph{base-learners}. Base-learners are simple regression
estimators with a fixed set of input variables and a univariate response. The
sets of input variables are allowed to differ among the base-learners.
Usually, the input variables of the base-learners are small subsets of the set
of predictor variables $x_1,\dots,x_p$. For example, in the simplest case,
there is exactly one base-learner for each predictor variable, and the
base-learners are just simple linear models using the predictor variables as
input variables. Generally, the base-learners considered in this paper are
either penalized or unpenalized least squares estimators using small subsets
of the predictor variables as input variables (see
Section~\ref{sec:baselearners} for details and examples). Each base-learner
represents a modeling alternative for the statistical model. Denote the number
of base-learners by $P$ and set $m=0$.
\item Increase $m$ by 1, where $m$ is the number of iterations.
\item
\begin{enumerate}
\item Compute the negative gradient $-\frac{\partial \rho}{\partial f}$ of the
loss function and evaluate it at $\hat{f}^{[m-1]} (\x_i^\top)$, $i=1,\dots,n$ (i.e.,
at the estimate of the previous iteration). This yields the negative gradient
vector
\begin{eqnarray*}
&& \uv^{[m]} = \left(u_{i}^{[m]}\right)_{i=1,\dots,n} := \left( -
\frac{\partial}{\partial f} \rho \left(y_i,\hat{f}^{[m-1]} (\x_i^\top)\right)
\right)_{i=1,\dots,n} \ .
\end{eqnarray*}
\item Fit each of the $P$ base-learners to the negative gradient vector, i.e.,
use each of the regression estimators specified in step 2 separately to fit
the negative gradient. The resulting $P$ regression fits yield $P$ vectors of
predicted values, where each vector is an estimate of the negative gradient
vector $\uv^{[m]}$.
\item Select the base-learner that fits $\uv^{[m]}$ best according to the
residual sum of squares (RSS) criterion and set $\hat{\uv}^{[m]}$ equal to the
fitted values of the best-fitting base-learner.
\item Update the current estimate by setting $\hat{f}^{[m]} = \hat{f}^{[m-1]} +
\nu \, \hat{\uv}^{[m]}$, where $0 < \nu \le 1$ is a real-valued step length
factor.
\end{enumerate}
\item Iterate Steps 3 and 4 until the stopping iteration $m_{\mathrm{stop}}$ is
reached (the choice of $m_{\mathrm{stop}}$ is discussed below).
\end{enumerate}
From step 4 it is seen that an estimate of the true negative gradient of
$\mathcal{R}$ is added to the current estimate of $f^*$ in each iteration.
Consequently, the component-wise boosting algorithm descends along the gradient
of the empirical risk $\mathcal{R}$. The empirical risk $\mathcal{R}$ is
thus minimized in a stage-wise fashion, and a structural (regression)
relationship between $y$ and $\x$ is established. This strategy corresponds to
replacing classical Fisher scoring algorithms for maximum likelihood estimation
of $f^*$ (\citealt{mccullagh}) by a gradient descent algorithm in function
space. As seen from steps 4(c) and 4(d), the algorithm additionally carries out
variable selection and model choice, as only one base-learner is selected for
updating~$\hat{f}^{[m]}$ in each iteration. For example, if each base-learner
corresponds to exactly one predictor variable (that is used as the input
variable of the respective base-learner), only {\em one} predictor variable is
selected in each iteration (hence the term ``component-wise''). Note that the
variable selection property of the above-described algorithm is not a general
feature of all boosting methods (where, in principle, any type of regression
estimator could be used to fit the negative gradient) but is caused by the
specific definition of the base-learning mechanism in steps 2 and 4.
Due to the additive update, the final boosting estimate in iteration $m_{\mathrm{stop}}$
can be interpreted as an additive prediction function. It is easily seen that
the structure of this function is equivalent to the structure of the additive
predictor of a GAM (see \citealt{hastietib}). This means that
\begin{equation}
\label{gamfun}
\hat{f} = \hat{f}_1 + \dots + \hat{f}_P \ ,
\end{equation}
where $\hat{f}_1,\dots,\hat{f}_P$ correspond to the functions specified by the
base-learners. Consequently, $\hat{f}_1,\dots,\hat{f}_P$ depend on the predictor
variables that were used as input variables of the respective base-learners.
Note that a base-learner can be selected multiple times in the course of the
boosting algorithm. In this case, its function estimate $\hat{f}_j$, $j \in
1,\dots,P$, is the sum of the individual estimates $\nu \cdot \hat{\uv}^{[m-1]}$
obtained in the iterations in which the base-learner was selected. Note also
that some of the $\hat{f}_j$, $j=1,\dots,P$, might be equal to zero, as the
corresponding base-learners might not have been selected in step 4(c). This can
then be considered as variable selection or model choice (depending on the
specification of the base-learners).
From step 4 of the component-wise gradient boosting algorithm it is clear that
the specification of the base-learners is crucial for interpreting the model
fit. As a general rule, due to the additive update in step 4(d), the estimate of
a function~$f_j$ at iteration $m_{\mathrm{stop}}$ has the same structure as the
corresponding base-learner. For example, $f_j$ is a linear function if the
base-learner used to model $f_j$ in step~4(b) is a simple linear model (see
\citealt{BuhlmannHothorn06}, p.~484, also see Section \ref{sec:mboost} for
details). Similarly, $f_j$ is a smooth function of the $j$th covariate $\x_j$ if
the corresponding base-learner is smooth as well. Note that it is generally
possible to incorporate observation weights into component-wise gradient
boosting. This is accomplished by using weighted regression techniques for each
of the base-learners in step 4.
A crucial issue is the choice of the stopping iteration
$m_{\mathrm{stop}}$. Several authors have argued that boosting algorithms should
generally not be run until convergence. Otherwise, overfits resulting in a
suboptimal prediction accuracy would be likely to occur (see
\citealt{BuhlmannHothorn06}). We therefore use a finite stopping iteration for
component-wise gradient boosting that optimizes the prediction accuracy (``early
stopping strategy''). As a consequence, the stopping iteration becomes a tuning
parameter of the algorithm, and cross-validation techniques or AIC-based
techniques can be used to estimate the optimal $m_{\mathrm{stop}}$. In
contrast to the choice of the stopping iteration, the choice of the step length
factor $\nu$ has been shown to be of minor importance for the predictive
performance of a boosting algorithm. The only requirement is that the value of
$\nu$ is small (e.g., $\nu=0.1$, see
\citealt{Schmid:Hothorn:boosting-p-Splines}). Small values of $\nu$ are
necessary to guarantee that the algorithm does not overshoot the minimum of the
empirical risk $\mathcal{R}$.
A major consequence of early stopping is that boosting optimizes prediction
accuracy by regularizing the estimates of $f^*$ via shrinkage techniques. This
is due to the fact that using a small step length $\nu$ ensures that effect estimates increase ``slowly'' in the course of the
boosting algorithm, and that the estimates stop increasing as soon as the
optimal stopping iteration $m_{\mathrm{stop}}$ is reached. Stopping a
component-wise gradient boosting algorithm at the optimal iteration therefore
implies that effect estimates are shrunken such that the predictive power of the
GAM is maximal. Shrinking estimates is a widely used strategy for
building prognostic models: Estimates tend to have a slightly increased bias but
a decreased variance, thereby improving prediction accuracy. In addition,
shrinkage generally stabilizes effect estimates and avoids multicollinearity
problems. Despite the bias induced by shrinking effect estimates, however, the
structure of function (\ref{gamfun}) ensures that results are
interpretable and that black-box estimates are avoided. The interpretation of
boosting estimates is essentially the same as those of classical maximum
likelihood estimates.
\section{The Package mboost}
\label{sec:mboost}
As pointed out above, the \textsf{R} add-on package \textbf{mboost} implements
model-based boosting methods as introduced above that result in interpretable
models. This is in contrast to other boosting packages such as \textbf{gbm}
\citep{Ridgeway:gbm:2010}, which implements tree-based boosting methods that
lead to black-box predictions. The \textbf{mboost} package offers a modular
nature that allows to specify a wide range of statistical models.
A generalized additive model is specified as the combination of a distributional
assumption and a structural assumption. The distributional assumption specifies
the conditional distribution of the outcome. The structural assumption specifies
the types of effects that are to be used in the model, i.e., it represents the
deterministic structure of the model. Usually, it specifies how the predictors
are related to the conditional mean of the outcome. To handle a broad class of
models within one framework, \textbf{mboost} also allows to specify effects for
conditional quantiles, conditional expectiles or hazard rates. The
distributional assumption, i.e., the loss function that we want to minimize, is
specified as a \R{family}. The structural assumption is given as a \R{formula}
using base-learners.
The loss function, as specified by the \R{family} is independent of the
estimation of the base-learners. As one can see in the component-wise boosting
algorithm, the loss function is only used to compute the negative gradient in
each boosting step. The predictors are then related to these values by penalized
ordinary least squares estimation, irrespective of the loss function. Hence, the
user can freely combine structural and distributional assumptions to tackle
new estimation problems.
In Section~\ref{sec:glmboost} we will derive a special case of component-wise
boosting to fit generalized linear models. In Section~\ref{sec:gamboost} we
introduce the methods to fit generalized additive models and give an
introduction to the available base-learners. How different loss functions can be
specified is shown in Section~\ref{sec:family}.
\subsection{Fitting Generalized Linear Models: \R{glmboost}}
\label{sec:glmboost}
The function \R{glmboost()} provides an interface to fit (generalized) linear
models. A (generalized) linear model of the covariates $\x = (x_1,
\ldots, x_p)^\top$ has the form
\begin{equation*}
g(\vec{\mu}) = \beta_0 + \beta_1 x_1 + \ldots + \beta_px_p
\end{equation*}
with the (conditional) expectation of the response $\vec{\mu} = \mathds{E}(y |
\vec{x})$, the link function $g$ and parameters $\vec{\beta}$. The resulting
models from \R{glmboost()} can be essentially interpreted the same way as
models that are derived from \R{glm()}. The only difference is that the
boosted generalized linear model additionally performs variable selection as
described in Section~\ref{sec:theory} and the effects are shrunken toward zero
if early stopping is applied. In each boosting iteration, \R{glmboost()} fits
simple linear models without intercept \emph{separately for each column of the
design matrix} to the negative gradient vector. Only the best-fitting linear
model (i.e., the best fitting base-learner) is used in the update step. If
factors are used in the model, this results in separate updates for the
effects of each factor level as these are (after dummy coding) represented by
separate columns in the design matrix. If one wants to treat factors or groups
of variables as one base-learner with a common update, one needs to use the
\R{gamboost()} function with a \R{bols()} base-learner for each factor or
group (see Section~\ref{sec:gamboost}, especially Table~\ref{tab:bols}, for
details).
The interface of \R{glmboost()} is in essence the same as for \R{glm()}. Before
we show how one can use the function to fit a linear model to predict the body
fat content, we give a short overview on the function\footnote{Note that here
and in the following we sometimes restrict the focus to the most important or
most interesting arguments of a function. Further arguments might exist. Thus,
for a complete list of arguments and their description refer to the respective
manual.}:
\begin{Schunk}
\begin{Sinput}
glmboost(formula, data = list(), weights = NULL,
center = TRUE, control = boost_control(), ...)
\end{Sinput}
\end{Schunk}
\noindent The model is specified using a \R{formula} as in \R{glm()} of the form
\R{response \~{} predictor1 + predictor2} and the data set is provided as a
\R{data.frame} via the \R{data} argument. Optionally, \R{weights} can be given
for weighted regression estimation. The argument \R{center} is specific for
\R{glmboost()}. It controls whether the data is internally centered. Centering
is of great importance, as this allows much faster ``convergence'' of the
algorithm or even ensures that the algorithm converges in the direction of the
true value at all. We will discuss this in detail at the end of
Section~\ref{sec:glmboost}. The second boosting-specific argument, \R{control},
allows to define the hyper-parameters of the boosting algorithm. This is done
using the function \R{boost\_control()}. For example one could specify:
\begin{Schunk}
\begin{Sinput}
R> boost_control(mstop = 200, ## initial number of boosting iterations. Default: 100
+ nu = 0.05, ## step length. Default: 0.1
+ trace = TRUE) ## print status information? Default: FALSE
\end{Sinput}
\end{Schunk}
Finally, the user can specify the distributional assumption via a \R{family},
which is ``hidden'' in the \,`\R{\ldots}'\, argument (see
\R{?mboost\_fit}\footnote{\R{glmboost()} merely handles the preprocessing of the
data. The actual fitting takes place in a unified framework in the function
\R{mboost\_fit()}.} for details and further possible parameters). The default
\R{family} is \R{Gaussian()}. Details on families are given in
Section~\ref{sec:family}. Ways to specify new families are described in the
Appendix.
\paragraph{Case Study: Prediction of Body Fat}\hfill\\
\noindent The aim of this case study is to compute accurate predictions for the
body fat of women based on available anthropometric measurements. Observations
of 71 German women are available with the data set \R{bodyfat} \citep{garcia}
included in \textbf{mboost}. We first load the package and the data set%
\footnote{The data set \R{bodyfat} has been moved to the package \textbf{TH.data}.}.
<>=
library("mboost") ## load package
data("bodyfat", package = "TH.data") ## load data
@
The response variable is the body fat measured by DXA (\R{DEXfat}), which can be
seen as the gold standard to measure body fat. However, DXA measurements are too
expensive and complicated for a broad use. Anthropometric measurements as waist
or hip circumferences are in comparison very easy to measure in a standard
screening. A prediction formula only based on these measures could therefore be
a valuable alternative with high clinical relevance for daily usage. The
available variables and anthropometric measurements in the data set are
presented in Table~\ref{tab:casepred}.
\begin{table}[h!]
\centering
\caption{Available variables in the \R{bodyfat} data, for details see \cite{garcia}.}
\label{tab:casepred}
\begin{tabular}{lp{0.5\linewidth}}
\toprule
Name & Description \\
\midrule
\R{DEXfat} &body fat measured by DXA (response variable) \\
\R{age} & age of the women in years \\
\R{waistcirc} & waist circumference \\
\R{hipcirc} & hip circumference \\
\R{elbowbreadth} & breadth of the elbow \\
\R{kneebreadth} & breadth of the knee \\
\R{anthro3a} & sum of logarithm of three anthropometric measurements \\
\R{anthro3b} & sum of logarithm of three anthropometric measurements \\
\R{anthro3c} & sum of logarithm of three anthropometric measurements \\
\R{anthro4} & sum of logarithm of four anthropometric measurements \\
\bottomrule
\end{tabular}
\end{table}
In the original publication \citep{garcia}, the presented prediction formula was
based on a linear model with backward-elimination for variable selection. The
resulting final model utilized hip circumference (\R{hipcirc}), knee breadth
(\R{kneebreadth}) and a compound covariate (\R{anthro3a}), which is defined as
the sum of the logarithmic measurements of chin skinfold, triceps skinfold and
subscapular skinfold:
<>=
## Reproduce formula of Garcia et al., 2005
lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat)
coef(lm1)
@
\noindent A very similar model can be easily fitted by boosting, applying
\R{glmboost()} with default settings:
<>=
## Estimate same model by glmboost
glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat)
coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept
@
\noindent Note that in this case we used the default settings in \R{control} and
the default family \R{Gaussian()} leading to boosting with the $L_2$ loss.
We now want to consider all available variables as potential predictors. One way
is to simply specify \R{"."} on the right side of the formula:
<>=
glm2 <- glmboost(DEXfat ~ ., data = bodyfat)
@
As an alternative one can explicitly provide the whole formula by using the
\R{paste()} function\footnote{Another alternative is given by the matrix
interface for \R{glmboost()} where one can directly use the design matrix as
an argument. For details see \R{?glmboost}.}. Therefore, one could essentially
call:
<>=
preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors
fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula
fm
@
and provide \R{fm} to the \R{formula} argument in \R{glmboost()}. Note that a
solution using the \R{paste()} function is somewhat unavoidable when we intend
to combine different base-learners for plenty of predictors in \R{gamboost()}.
Note that at this iteration (\R{mstop} is still 100 as it is the default value)
\R{anthro4} is not included in the resulting model as the corresponding
base-learner was never selected in the update step. The function \R{coef()} by
default only displays the selected variables but can be forced to show all
effects by specifying \R{which = ""}:
<>=
coef(glm2, ## usually the argument 'which' is used to specify single base-
which = "") ## learners via partial matching; With which = "" we select all.
@
A plot of the coefficient paths, similar to the ones commonly known from the
LARS algorithm \citep{Efron2002}, can be easily produced by using \R{plot()} on
the \R{glmboost} object (see Figure~\ref{fig:coefpaths}):
<>=
plot(glm2, off2int = TRUE) ## default plot, offset added to intercept
## now change ylim to the range of the coefficients without intercept (zoom-in)
plot(glm2, ylim = range(coef(glm2, which = preds)))
@
\setkeys{Gin}{width = 0.45\textwidth}
\begin{figure}[ht]
\centering
<>=
## same as first plot from chunk "plot_glmboost" but with better margins
par(mar = c(4, 4, 0, 6) + 0.1)
plot(glm2, main="", off2int=TRUE)
@
\hfill
<>=
## same as second plot from chunk "plot_glmboost" but with better margins
par(mar = c(4, 4, 0, 6) + 0.1)
plot(glm2, ylim = range(coef(glm2, which = preds)), main = "")
@
\caption{Coefficients paths for the body fat data: Here, the default plot on the
left leads to hardly readable variable names due to the inclusion of the
intercept. For the plot on the right we therefore adjusted the y-scale to
avoid this problem. \label{fig:coefpaths}}
\end{figure}
\paragraph{Centering of Linear Base-learners Without Intercept}\hfill\\
\noindent For linear base-learners that are specified without
intercept,\footnote{If the fitting function \texttt{glmboost()} is used the
base-learners never contain an intercept. Furthermore, linear base-learners
without intercept can be obtained by specifying a base-learner \texttt{bols(x,
intercept = FALSE)} (see below).} it is of great importance to center the
covariates before fitting the model. Without centering of the covariates, linear
effects that result from base-learners without intercept are forced through the
origin (with no data lying there). Hence, the convergence will be very slow or
the algorithm will not converge to the ``correct'' solution even in very simple
cases. As an example, consider one normally distributed predictor $\x = (x_1,
\ldots, x_n)^\top$, and a model
\begin{equation*}
\y = \beta \x + \mbf{\eps},
\end{equation*}
with $\beta = 1$ and $\mbf{\eps} \sim \mathcal{N}(\mbf{0}, 0.3^2)$. Usually, a
model without intercept could be fitted to estimate $\beta$. However, if we
apply boosting with the L$_2$ loss the negative gradient in the first boosting
step is, by default, the centered response, i.e., $\uv^{[1]} = \y - 1/n \sum_{i
= 1}^n y_i$. For other loss functions the negative gradient in the first
boosting iteration is not exactly the mean-centered response. Yet, the negative
gradient in the first step is always ``centered'' around zero. In this
situation, the application of a base-learner without intercept (i.e., a simple
linear model without intercept) is not sufficient anymore to recover the effect
$\beta$ (see Figure~\ref{fig:without_centering}). The true effect is completely
missed. To solve this problem, it is sufficient to use a centered predictor
$\x$. Then, the center of the data is shifted to the origin (see
Figure~\ref{fig:with_centering}) and the model without intercept goes through
the origin as well.
%% figure is build at the end of the document
\setkeys{Gin}{width = 0.45\textwidth}
\begin{figure}[ht]
\centering
\subfigure[Without Centering \label{fig:without_centering}]{
\includegraphics{graphics/fig-center_false}
}
\subfigure[With Centering \label{fig:with_centering}]{
\includegraphics{graphics/fig-center_true}
}
\caption{L$_2$Boosting in the first boosting step, i.e., with centered response
variable as outcome. A base-learner without intercept misses the true effect
completely if $\x$ is not centered (left) and is able to capture the true
structure if $\x$ is centered (right).\label{fig:centering}}
\end{figure}
Centering the predictors does not change the estimated effects of the
predictors. Yet, the intercept needs to be corrected as can be seen from the
following example. Consider two predictors and estimate a model with centered
predictors, i.e,
\begin{align*}
\hat{\y} &= \hat{\beta}_0 + \hat{\beta}_1 (\x_1 - \bar{x}_1) + \hat{\beta}_2
(\x_2 -
\bar{x}_2) \quad\quad \Leftrightarrow\\
\hat{\y} &= \underbrace{(\hat{\beta}_0 - \hat{\beta}_1 \bar{x}_1 -
\hat{\beta}_2 \bar{x}_2)}_{= \hat{\beta}_0^*} + \hat{\beta}_1 \x_1 +
\hat{\beta}_2 \x_2.
\end{align*}
Hence, the intercept from a model without centering of the covariates equals
$\hat{\beta}_0^* = \hat{\beta}_0 - \sum_j \hat{\beta}_j \bar{x}_j$, where
$\hat{\beta}_0$ is the intercept estimated from a model with centered
predictors.
\subsection{Fitting Generalized Additive Models: \R{gamboost}}
\label{sec:gamboost}
Besides an interface to fit linear models, \textbf{mboost} offers a very
flexible and powerful function to fit additive models. An additive model of the
covariates $\x = (x_1, \ldots, x_p)^\top$ has, in general, the form
\begin{equation*}
g(\vec{\mu}) = \beta_0 + f_1 + \ldots + f_p
\end{equation*}
with the (conditional) expectation of the response $\vec{\mu} = \mathds{E}(y |
\vec{x})$, the link function $g$ and arbitrary functions $f_1, \ldots, f_p$ of
the covariates. These functions include simple, linear functions as well as
smooth, non-linear functions. The functions are specified by the base-learners
that are introduced in the following paragraphs.
The function \R{gamboost()} can be used to fit linear models or (non-linear)
additive models via component-wise boosting. The user additionally needs to
state which variable should enter the model in which fashion, e.g. as a linear
effect or as a smooth effect. In general, however, the interface of
\R{gamboost()} is very similar to \R{glmboost()}.
\begin{Schunk}
\begin{Sinput}
gamboost(formula, data = list(), ...)
\end{Sinput}
\end{Schunk}
Again, the function requires a formula to specify the model. Furthermore, a data
set needs to be specified as for linear models. Additional arguments that are
passed to \R{mboost\_fit()}\footnote{\R{gamboost()} also calls \R{mboost\_fit()}
for the actual boosting algorithm.} include \R{weights}, \R{control} and
\R{family}. These arguments can be used in the same way as described for
\R{glmboost()} above.
\subsubsection{Base-learners}\label{sec:baselearners}
The structural assumption of the model, i.e., the types of effects that are to
be used, can be specified in terms of base-learners. Each base-learner results
in a related type of effect. An overview of available base-learners is given in
the following paragraphs. An example of how these base-learners can then be
combined to formulate the model is given afterwards.
The base-learners should be defined such that the degrees of freedom of the
single base-learner are small enough to prevent overshooting. Typically one
uses, for example, 4 degrees of freedom (the default for many base-learners) or
less. Despite the small initial degrees of freedom, the final estimate that
results from this base-learner can adopt higher order degrees of freedom due to
the iterative nature of the algorithm. If the base-learner is chosen multiple
times, it overall adapts to an appropriate degree of flexibility and smoothness.
All base-learners considered in this tutorial are simple penalized least squares
models of the form
\begin{equation*}
\hat{\uv} = \X (\X^\top\X + \lambda
\K)^{-1}\X^\top \uv = \mathcal{S} \uv,
\end{equation*}
where the hat-matrix is defined as $\mathcal{S} = \X (\X^\top\X + \lambda
\K)^{-1}\X^\top$, with design matrix $\X$, penalty parameter $\lambda$ and
penalty matrix $\K$. The design and penalty matrices depend on the type of
base-learner. A penalty parameter $\lambda = 0$ results in unpenalized
estimation.
\paragraph{Linear and Categorical Effects}\hfill\\
\noindent The \R{bols()} function\footnote{the name refers to \textbf{o}rdinary
\textbf{l}east \textbf{s}quares base-learner} allows the definition of
(penalized) ordinary least squares base-learners. Examples of base-learners of
this type include (a) linear effects, (b) categorical effects, (c) linear
effects for groups of variables $\x = (x_1, x_2, \ldots, x_p)^\top$, (d)
ridge-penalized effects for (b) and (c), (e) varying coefficient terms (i.e.,
interactions). If a penalized base-learner is specified, a special penalty based
on the differences of the effects of adjacent categories is automatically used
for ordinal variables (\R{x <- as.ordered(x)}) instead of the ridge penalty
\citep{Hofner:unbiased:2011}. Figure~\ref{fig:expl_bols} shows two effect
estimates that result from \R{bols()} base-learners, a simple linear effect and
an effect estimate for a factor variable. The call to an ordinary penalized
least squares base-learner is as follows:
\begin{Schunk}
\begin{Sinput}
bols(..., by = NULL, intercept = TRUE, df = NULL, lambda = 0)
\end{Sinput}
\end{Schunk}
The variables that correspond to the base-learner are specified in the
\,`\R{\ldots}'\, argument, separated by commas. If multiple variables are
specified, they are treated as \emph{one group}. A linear model is fitted for
all given variables together and they are either all updated or not at all. An
additional variable that defines varying coefficients can optionally
be given in the \R{by} argument. The logical variable \R{intercept} determines
whether an intercept is added to the design matrix (\R{intercept = TRUE}, the
default). If \R{intercept = FALSE}, continuous covariates should be (mean-)
centered as discussed above. This must be done `by hand' in advance of fitting
the model. The impact of the penalty in case of penalized OLS (ordinary least squares) base-learners can
be determined either via the degrees of freedom \R{df} or the penalty parameter
\R{lambda}. If degrees of freedom are specified, the penalty parameter
\R{lambda} is computed from \R{df}\footnote{If \R{df} is specified in
\R{bols()}, \R{lambda} is always ignored.}. Note that per default unpenalized
linear models are used. Two definitions of degrees of freedom are implemented in
\textbf{mboost}: The first uses the trace of the hat-matrix
($\text{trace}(\mathcal{S})$) as degrees of freedom, while the second definition
uses $\text{trace}(2 \mathcal{S} - \mathcal{S}^\top\mathcal{S})$. The latter
definition is tailored to the comparison of models based on residual sums of
squares and hence is better suitable in the boosting context \citep[see][for
details]{Hofner:unbiased:2011}.
%% figure is build at the end of the document
\setkeys{Gin}{width = 0.45\textwidth}
\begin{figure}[ht]
\centering
\includegraphics{graphics/fig-bolsx1}
\hfill
\includegraphics{graphics/fig-bolsx2}
\caption{Examples for linear and categorical effects estimated using \R{bols()}:
Linear effect (left; $f(x_1) = 0.5 x_1$) and categorical effect (right;
$f(x_2) = 0 x_2^{(1)} - 1 x_2^{(2)} + 0.5 x_2^{(3)} + 3 x_2^{(4)}$). The
effects can be estimated with base-learners \R{bols(x1)} and \R{bols(x2)},
where \R{x1} is a continuous covariate and \R{x2} is a categorical covariate
coded as \R{factor} in \textsf{R}. \label{fig:expl_bols}}
\end{figure}
Table~\ref{tab:bols} shows some calls to \R{bols()} and gives the resulting type
of effect. To gain more information on a specific base-learner it is possible to
inspect the base-learners in \textbf{mboost} using the function \R{extract()} on
a base-learner or a boosting model. With the function one can extract, for
example, the design matrices, the penalty matrices and many more characteristics
of the base-learners. To extract, for example, the design matrix that results
from a linear base-learner of a factor \R{z}, say \R{z <- factor(1:3)}, we can
use
\begin{Schunk}
\begin{Sinput}
R> extract(bols(z))
\end{Sinput}
\begin{Soutput}
(Intercept) z2 z3
1 1 0 0
2 1 1 0
3 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$z
[1] "contr.treatment"
\end{Soutput}
\end{Schunk}
Thus, we see that base-learners of factors use treatment contrasts per default.
For a detailed instruction on how to use the function see
the manual for \R{extract()} and especially the examples therein.
\begin{table}[h!]
\centering
\caption{Some examples of effects that result from \R{bols()}}
\label{tab:bols}
\begin{tabular}{lp{0.5\linewidth}}
\toprule
Call & Type of Effect\\
\midrule
\R{bols(x)} & linear effect: $\x^\top\bv$ with $\x^\top = (1, x)$\\
\R{bols(x, intercept = FALSE)} & linear effect without intercept:
$\beta \cdot x$\\
\R{bols(z)} & OLS fit with factor $z$ (i.e., linear effect
after dummy coding)\\
\R{bols(z, df = 1)} & ridge-penalized OLS fit with one degree of freedom
and factor $z$; If $z$ is an ordered factor a difference penalty is used
instead of the ridge penalty.\\
\R{bols(x1, x2, x3)} & one base-learner for three variables
(group-wise selection):\newline $\x^\top\bv$ with $\x^\top = (1, x_1, x_2, x_3)$\\
\R{bols(x, by = z)} & interaction: $\x^\top\bv \cdot z$ (with continuous
variable $z$). If $z$ is a factor, a separate effect is estimated for each
factor level; Note that in this case, the main effect needs to be specified
additionally via \R{bols(x)}.\\
\bottomrule
\end{tabular}
\end{table}
\paragraph{Smooth Effects}\hfill\\
\noindent The \R{bbs()} base-learner\footnote{the name refers to B-splines with
penalty, hence the second b} allows the definition of smooth effects (i.e.,
functions of the covariate that are required to be sufficiently smooth, see
Figure~\ref{fig:expl_bbs}) based on B-splines \citep{deBoor:splines:1978} with
difference penalty \citep[i.e., P-splines, cf.][]{eilers1996}. P-splines can be
seen as a versatile modeling tool for smooth effects. A wide class of effects
can be specified using P-splines. Examples include (a) smooth effects, (b)
bivariate smooth effects (e.g., spatial effects), (c) varying coefficient terms,
(d) cyclic effects (= periodic effects) and many more. Two examples of smooth
function estimates fitted using a \R{bbs()} base-learner are given in
Figure~\ref{fig:expl_bbs}. The call to a P-spline base-learner is as follows:
\begin{Schunk}
\begin{Sinput}
bbs(..., by = NULL, knots = 20, boundary.knots = NULL,
degree = 3, differences = 2, df = 4, lambda = NULL,
center = FALSE, cyclic = FALSE)
\end{Sinput}
\end{Schunk}
As for all base-learners, the variables that correspond to the base-learner can
be specified in the \,`\R{\ldots}'\, argument. Usually, only one variable is
specified here for smooth effects, and at maximum two variables are allowed for
\R{bbs()}. Varying coefficient terms $f(x) \cdot z$ can be specified using
\R{bbs(x, by = z)}. In this case, $x$ is called the effect modifier of the
effect of $z$. The \R{knots} argument can be used to specify the number of
equidistant knots, or the positions of the interior knots. In case of two
variables, one can also use a named list to specify the number or the position
of the interior knots for each of the variables separately. For an example of
the usage of a named list see Section~\ref{sec:model_building}. The location of
boundary knots (default: range of the data) can be specified using
\R{boundary.knots}. Usually, no user-input is required here. The only exception
is given for cyclic splines (see below). The degree of the B-spline bases
(\R{degree}) and the order of the difference penalty (\R{differences}; $\in \{0,
1, 2, 3\}$) can be used to specify further characteristics of the spline
estimate. The latter parameter specifies the type of boundary effect. For
\R{differences = 2}, for example, deviations from a linear function are
penalized. In the case of first order differences, deviations from a constant
are subject to penalization. The smoothness of the base-learner can be specified
using degrees of freedom (\R{df}) or the smoothing parameter (\R{lambda})\footnote{If
\R{lambda} is specified in \R{bbs()}, \R{df} is always ignored.}.
%% figure is build at the end of the document
\setkeys{Gin}{width = 0.45\textwidth}
\begin{figure}[ht]
\centering
\includegraphics{graphics/fig-bbsx1}
\hfill
\includegraphics{graphics/fig-bbsx2}
\caption{Examples for smooth effects estimated using \R{bbs(x1)} (left) and
\R{bbs(x2, knots = quantile(x2, c(0.25, 0.5, 0.75)), df = 5)} (right). Hence,
the left P-spline uses the default values of 20 equidistant inner knots and 4
degrees of freedom, while the right P-spline estimate is derived with 3 inner
knots placed at the quartiles and 5 degrees of freedom. True effects:
$f_1(x_1) = 3 \cdot \sin(x_1)$ (left) and $f_2(x_2) = x_2^2$
(right). \label{fig:expl_bbs}}
\end{figure}
An issue in the context of P-splines is that one cannot make the degrees of
freedom arbitrary small. A polynomial of order \R{differences - 1} always
remains unpenalized (i.e., the so-called null space). As we usually apply second
order differences, a linear effect (with intercept) remains unpenalized and thus
$\df \geq 2$ for all smoothing parameters. A solution is given by a P-spline
decomposition
\citep[see][]{Kneib:Hothorn:Tutz:modelchoice:2009,Hofner:unbiased:2011}. The
smooth effect $f(x)$ is decomposed in the unpenalized polynomial and a smooth
deviation from this polynomial. For differences of order 2 (= default), we thus
get:
\begin{equation}\label{eq:decomp}
f(x) = \underbrace{\beta_0 + \beta_1 x}_{\text{unpenalized polynomial}} +
\underbrace{f_{\text{centered}}(x)}_{\text{smooth deviation}}
\end{equation}
The unpenalized polynomial can then be specified using \R{bols(x, intercept =
FALSE)} and the smooth deviation is obtained by \R{bbs(x, center = TRUE)}.
Additionally, it is usually advised to specify an explicit base-learner for the
intercept (see Section~\ref{sec:model_building}).
A special P-spline base-learner is obtained if \R{cyclic = TRUE}. In this case,
the fitted values are forced to coincide at the boundary knots and the function
estimate is smoothly joined (see Figure~\ref{fig:expl_cyclic}). This is
especially interesting for time-series or longitudinal data were smooth,
periodic functions should be modeled. In this case it is of great importance
that the boundary knots are properly specified to match the points were the
function should be joined (due to subject matter knowledge).
An non-exhaustive overview of some usage scenarios for \R{bbs()} base-learners
is given in Table~\ref{tab:bbs}.
\begin{table}[h!]
\centering
\caption{Some examples of effects that result from \R{bbs()}}
\label{tab:bbs}
\begin{tabular}{p{0.5\linewidth}p{0.45\linewidth}}
\toprule
Call & Type of Effect\\
\midrule
\R{bbs(x, by = z)} & varying coefficient: $f(x) \cdot z = \beta(x)z$ (with
continuous variable $z$). If $z$ is a factor, a separate smooth effect is
estimated for each factor level; Note that in this case, the main effect
needs to be specified additionally via \R{bbs(x)}.\\
\R{bbs(x, knots = 10)} & smooth effect with 10 inner knots \\
\R{bbs(x, boundary.knots = c(0, 2 * pi),}\newline\hphantom{\R{bbs(}}\R{cyclic = TRUE)} &
periodic function with period $2\pi$\\
\R{bbs(x, center = TRUE, df = 1)} & P-spline decomposition (\R{center =
TRUE}), which is needed to specify arbitrary small \R{df} for P-splines\\
\bottomrule
\end{tabular}
\end{table}
%% figure is build at the end of the document
\setkeys{Gin}{width = 0.45\textwidth}
\begin{figure}[ht]
\centering
\includegraphics{graphics/fig-cyclic1}
\hfill
\includegraphics{graphics/fig-cyclic2}
\caption{Example for cyclic splines. The unconstrained effect in the left figure
is fitted using \R{bbs(x, knots = 12)}, the cyclic effect in the right is
estimated using \R{bbs(x, cyclic = TRUE, knots = 12, boundary.knots = c(0,
2*pi))}. True effect: $f(x) = \sin(x)$.}.\label{fig:expl_cyclic}
\end{figure}
\paragraph{Smooth Surface Estimation}\hfill\\
\noindent An extension of P-splines to two dimensions is given by bivariate
P-splines. They allow to fit spatial effects and smooth interaction surfaces. An
example of a two dimensional function estimate is given in
Figure~\ref{fig:expl_bspatial}. For further details on bivariate P-splines in
the boosting context see \citet{Kneib:Hothorn:Tutz:modelchoice:2009}. The
effects can be obtained from a call to
\begin{Schunk}
\begin{Sinput}
bspatial(..., df = 6)
\end{Sinput}
\end{Schunk}
To specify two dimensional smooth effects, the \,`\R{\ldots}'\, argument
requires two variables, i.e., a call such as \R{bspatial(x, y)}. Note that
\R{bspatial()} is just a wrapper to \R{bbs()} with redefined degrees of
freedom\footnote{Note that \R{df = 4} was changed to \R{df = 6} in
\textbf{mboost} 2.1-0.}. Thus, all arguments from \R{bbs()} exist and can be
used for \R{bspatial()}. An example is given in
Section~\ref{sec:model_building}.
%% figure is build at the end of the document
\setkeys{Gin}{width=0.45\textwidth}
\begin{figure}[ht]
\centering
\includegraphics{graphics/fig-bspatial1} \hfill
\includegraphics{graphics/fig-bspatial2}
\caption{Example for interaction surface fitted using \R{bspatial()}. Two
displays of the same function estimate are presented (levelplot: left;
perspective plot: right). True function: $f(x_1, x_2) = \sin(x_1) \cdot
\sin(x_2)$. \label{fig:expl_bspatial}}
\end{figure}
\paragraph{Random Effects}\hfill\\
\\
\noindent To specify random intercept or random slope terms in \textbf{mboost},
one can use a random effects base-learner. Random effects are used to model
subject specific effects (in contrast to the usual -- fixed -- effects which can
be seen as population specific effects). Usually, one is only interested in
modeling the increased variability but not in the effect estimate of a specific
individual as this effect is not transferable. For a comprehensive overview of
random effects models we refer to \citet{PinheiroBates:2000}. Random effects are
modeled as ridge-penalized group-specific effects. One can show that these
coincide with standard random effects. The ridge penalty parameter is then
simply the ratio of the error variance $\sigma^2$ and the random effects
variance $\tau^2$. Hence, the penalty parameter is inversely linked to $\tau^2$
(and the degrees of freedom are directly linked to $\tau^2$). As for all
base-learners, we initially specify a small value for the degrees of freedom,
i.e., we set a small random effects variance (relative to the residual
variance). By repeatedly selecting the base-learner, the boosting algorithm can
then adapt to the ``true'' random effects variance. For more details see
\citet[Web Appendix]{Kneib:Hothorn:Tutz:modelchoice:2009}. We can specify random
effects base-learners with a call to
\begin{Schunk}
\begin{Sinput}
brandom(..., df = 4)
\end{Sinput}
\end{Schunk}
As \R{brandom()} is just a wrapper of \R{bols()} with redefined degrees of
freedom, one can use all arguments that are available for \R{bols()}. To specify
a random intercept for a group variable, say \R{id}, we simply call
\R{brandom(id)} (and could additionally specify the \emph{initial} error
variance via \R{df} or \R{lambda}). To specify a random slope for another
variable, say \R{time}, within the groups of \R{id}, one can use \R{brandom(id,
by = time)}, i.e., the first argument determines the grouping (as for the
random intercept) and the \R{by} argument then determines the variable for which
the effect is allowed to vary between the groups. In the notation of
\textbf{nlme} \citep{nlme2012} and \textbf{lme4} \citep{lme4} the random
intercept would be specified as \R{(1 | id)} and the random slope would be
specified as \R{(time - 1 | id)}, i.e., a random slope without random intercept
term.
\paragraph{Additional Base-learners in a Nutshell}\hfill\\
\noindent Tree-based base-learner (per default stumps) can be included in the
model using the \R{btree()} base-learner
\citep{Hothorn:2006:JCGS,Hothorn+Buehlmann+Kneib+Schmid+Hofner:mboost:2010}.
Note that this is the only base-learner which is not fitted using penalized
least squares. Radial basis functions (e.g., for spatial effects) can be
specified using the base-learner \R{brad()}. Details on this base-learner are
given in \citep{Hofner:PhDThesis:2011}. Monotonic effects of continuous or
ordered categorical variables can be estimated using the \R{bmono()}
base-learner which can also estimate convex or concave effects. Details on
monotonic effect estimation and examples of the estimation of boosting models
with monotonic effects are given in \citet{Hofner:monotonic:2011} in the context
of species distribution models. The base-learner \R{bmrf()} implements Markov
random fields, which can be used for the estimation of spatial effects of
regions \citep{Sobotka:2010,mayr:gamboostlss:2011}. With the base-learner
\R{buser()}, one can specify arbitrary base-learners with quadratic penalty.
This base-learner is dedicated to advanced users only. Additionally, special
concatenation operators for expert users exist in \textbf{mboost} that all
combine two or more base-learners to a single base-learner: \verb_%+%_
additively joins two or more arbitrary base-learners to be treated as one group,
\verb_%X%_ defines a tensor product of two base-learners and
\verb_%O%_ implements the Kronecker product of two base-learners. In all
cases the degrees of freedom increase compared to the degrees of freedom of the
single base-learners (additively in the first, and multiplicatively in the
second and third case). For more details on any of these base-learners and for
some usage examples see the manual.
\subsubsection{Building a Model -- or: How to Combine Different Base-learners}
\label{sec:model_building}
In general, the choice of effect for a given covariate has to be based on the
subject matter knowledge of the involved researchers. There is no general rule
when to use a certain effect. In order to allow researchers to make a decision
on which effect to use we exemplified the different modeling types in graphical
displays (see Figures~\ref{fig:expl_bols} to \ref{fig:expl_bspatial}).
Once this decision is made, it is relatively easy to estimate the model using
the described boosting methodology. Base-learners can be combined in a simple
manner to form the desired model. They are combined in the formula as a sum of
base-learners to specify the desired model. We will show this using a simple toy
example:
\begin{Schunk}
\begin{Sinput}
R> m1 <- gamboost(y ~ bols(x1) + bbs(x2) + bspatial(s1, s2) + brandom(id),
+ data = example)
\end{Sinput}
\end{Schunk}
In this case, a linear effect for \R{x1}, a smooth (P-spline) effect for \R{x2},
a spatial effect for \R{s1} and \R{s2} are specified together with a random
intercept for \R{id}. This model could be further refined and expanded as shown
in the following example:
\begin{Schunk}
\begin{Sinput}
R> m2 <- gamboost(y ~ bols(int, intercept = FALSE) +
+ bols(x1, intercept = FALSE) +
+ bols(x2, intercept = FALSE) +
+ bbs(x2, center = TRUE, df = 1) +
+ bspatial(s1, s2, knots = list(s1 = 10, s2 = 20)) +
+ brandom(id) + brandom(id, by = x1),
+ data = example)
\end{Sinput}
\end{Schunk}
Now, with \R{example\$int = rep(1, length(y))}, we specify a separate intercept
in the model. In the first formula (\R{m1}), the intercept was implicitly
included in the base-learners. Now we allow the boosting algorithm to explicitly
choose and update solely the intercept. Note that therefore we need to remove
the implicit intercept from the base-learner for \R{int} (\R{intercept = FALSE})
as otherwise we would get a linear base-learner with \emph{two} intercepts which
has no unique solution. We can now also remove the intercept from the
base-learner of \R{x1}. This leads to a base-learner of the form $x_1\beta_1$.
For the smooth effect of \R{x2} we now use the decomposition~\eqref{eq:decomp}.
Hence, we specify the unpenalized part as a linear effect without intercept (the
intercept is already specified in the formula) and the smooth deviation from the
linear effect (with one degree of freedom). Now the boosting algorithm can
choose how \R{x2} should enter the model: (i) not at all, (ii) as a linear
effect, (iii) as a smooth effect centered around zero, or (iv) as the
combination of the linear effect and the smooth deviation. In the latter case we
essentially get the same result as from \R{bbs(x2)}. The spatial base-learner in
the second formula (\R{m2}) explicitly specifies the numbers of knots in the
direction of \R{s1} (10 inner knots) and \R{s2} (20 inner knots). This is
achieved by a named list where the names correspond to the names of the
variables. In total we get $10 \cdot 20 = 200$ inner knots. Usually, one
specifies equal numbers of knots in both directions, which requires no named
lists. Instead one can simply specify, e.g., \R{knots = 10}, which results in 10
inner knots per direction (i.e., 100 inner knots in total). Note that, as for
the smooth, univariate base-learner of \R{x2} one could also specify a
decomposition for the spatial base-learner:
\begin{Schunk}
\begin{Sinput}
bols(int, intercept = FALSE) +
bols(s1, intercept = FALSE) + bols(s2, intercept = FALSE) +
bols(s1, by = s2, intercept = FALSE) +
bspatial(s1, s2, center = TRUE, df = 1)
\end{Sinput}
\end{Schunk}
where \R{bols(s1, by = s2, intercept = FALSE)} specifies the interaction of
\R{s1} and \R{s2} (without intercept). Finally, we added a random slope for \R{x1} (in the groups of \R{id}) to \R{m2}.
Note that the groups are the argument of the base-learner and the slope variable
is specified via the \R{by} argument.
\paragraph{Case Study (ctd.): Prediction of Body Fat}\hfill\\
% case2: Case study non-linear base-learner
\noindent Until now, we only included linear effects in the prediction formula
for body fat of women. It might be of interest to evaluate if there exists also
a non-linear relationship between some of the predictors and the DXA measurement
of body fat. To investigate this issue, we fit a model with the same predictors
as in \cite{garcia} but without assuming linearity of the effects. We apply
\R{gamboost()} with the P-spline base-learner \R{bbs()} to incorporate smooth
effects.
<>=
## now an additive model with the same variables as lm1
gam1 <- gamboost(DEXfat ~ bbs(hipcirc) + bbs(kneebreadth) + bbs(anthro3a),
data = bodyfat)
@
Using \R{plot()} on a \R{gamboost} object delivers automatically the partial
effects of the different base-learners:
<>=
par(mfrow = c(1,3)) ## 3 plots in one device
plot(gam1) ## get the partial effects
@
\setkeys{Gin}{width = \textwidth}
\begin{figure}[ht]
\centering
<>=
## same as plot from chunk "plot_gamboost" but with better margins
par(mfrow=c(1,3))
par(mar = c(4, 4, 0, 0) + 0.1)
plot(gam1)
@
\caption{Partial effects of three anthropometric measurements on the body fat of
women. \label{fig:partialplot}}
\end{figure}
\setkeys{Gin}{width = 0.45\textwidth}
\noindent From the resulting Figure~\ref{fig:partialplot}, it seems as if in the
center of the predictor-grid (where most of the observations lie), the
relationship between these three anthropometric measurements and the body fat is
quite close to a linear function. However, at least for \R{hipcirc}, it seems
that there are indications of the existence of a non-linear relationship for
higher values of the predictor.
Alternatively, one could apply decomposition~\eqref{eq:decomp} for each of the 3
base-learners, as described in Section~\ref{sec:model_building}, to distinguish
between modeling alternatives. In this case, we would have a more rigorous
treatment of the decision between (purely) linear and non-linear effects if we
stop the algorithm at an appropriate iteration $m_\text{stop}$.
\subsection{Early Stopping to Prevent Overfitting}
\label{sec:cvrisk}
As already discussed in Section~\ref{sec:theory}, the major tuning parameter of
boosting is the number of iterations $m_{\text{stop}}$. To prevent overfitting
it is important that the optimal stopping iteration is carefully chosen. Various
possibilities to determine the stopping iteration exist. One can use information
criteria such as AIC\footnote{see \R{?AIC.boost} for further details} to find
the optimal model. However, this is usually not recommended as AIC-based
stopping tends to overshoot the optimal $m_{\text{stop}}$ dramatically
\citep[see][]{Hast:comment:2007,Mayr:mstop:2012}. Instead, it is advised to use
cross-validated estimates of the empirical risk to choose an appropriate number
of boosting iterations. This approach aims at optimizing the prognosis on new
data. In \textbf{mboost} infrastructure exists to compute bootstrap estimates,
k-fold cross-validation estimates and sub-sampling estimates. The main function
to determine the cross-validated risk is
\begin{Sinput}
cvrisk(object, folds = cv(model.weights(object)),
papply = mclapply)
\end{Sinput}
In the simplest case, the user only needs to specify the fitted boosting model
as \R{object}. If one wants to change the default cross-validation method
(25-fold bootstrap) the user can specify a weight matrix that determines the
cross-validation samples via the \R{folds} argument. This can be done either by
hand or using the convenience function \R{cv()} (see below). Finally, the user
can specify a function of the \R{lapply} ``type'' to \R{papply}. Per default
this is either \R{mclapply} for parallel computing if package \textbf{multicore}
\citep{Urbanek:2011} is available, or the code is run sequentially using
\R{lapply}. Alternatively, the user can apply other parallelization methods such
as \R{clusterApplyLB} \citep[package \textbf{snow}][]{Tierny:Snow:2011} with
some further effort for the setup (see \R{?cvrisk}).
The easiest way to set up a variety of weight matrices for cross-validation is
the function
\begin{Sinput}
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
B = ifelse(type == "kfold", 10, 25))
\end{Sinput}
One simply specifies the weights of the originally fitted model (e.g. using the
function \R{model.weights()} on the model) and the \R{type} of cross-validation.
This can be either \R{"bootstrap"} (default), \R{"kfold"} (k-fold
cross-validation) or \R{"subsampling"}\footnote{the percentage of observations
to be included in the learning samples for subsampling can be specified using
a further argument in \R{cv()} called \R{prob}. Per default this is 0.5.}. The
number of cross-validation replicates, per default, is chosen to be 10 for
k-fold cross-validation and 25 otherwise. However, one can simply specify
another number of replicates using the argument \R{B}.
To extract the appropriate number of boosting iterations from an object returned
by \R{cvrisk()} (or \R{AIC()}) one can use the extractor function \R{mstop()}.
Once an appropriate number of iterations is found, we finally need to stop the
model at this number of iterations. To increase or reduce the number of boosting
steps for the model \R{mod}, one can use the indexing / subsetting operator
directly on the model:
\begin{Schunk}
\begin{Sinput}
mod[i]
\end{Sinput}
\end{Schunk}
where \R{i} is the number of boosting iterations. Attention, the subset operator
differs in this context from the standard \textsf{R} behavior as it
\emph{directly changes the model} \R{mod}. Hence, there is no need to save
\R{mod} under a new name. This helps to reduce the memory footprint. Be aware
that even if you call something like
\begin{Schunk}
\begin{Sinput}
newmod <- mod[10]
\end{Sinput}
\end{Schunk}
you will change the boosting iteration of \R{mod}! Even more, if you now change
\R{mstop} for \R{newmod}, the model \R{mod} is also changed (and vice versa)!
This said, the good news is that nothing gets lost. If we reduce the model to a
lower value of $m_{\text{stop}}$, the additional boosting steps are kept
internally in the model object. Consider as an example the following scenario:
\begin{itemize}
\item We fit an initial model \R{mod} with \R{mstop = 100}.
\item We call \R{mod[10]}, which sets \R{mstop = 10}.
\item We now increase \R{mstop} to 40 with a call to \R{mod[40]}.
\end{itemize}
This now requires \emph{no re-computation} of the model as internally everything
was kept in storage. Again the warning, if we now extract anything from the
model, such as \R{coef(mod)}, we get the characteristics of the model with 40
iterations, i.e., here the coefficient estimates from the 40th boosting
iteration.
\paragraph{Case Study (ctd.): Prediction of Body Fat}\hfill\\
\noindent Until now, we used the default settings of \R{boost\_control()} with
\R{mstop = 100} for all boosted models. Now we want to optimize this tuning
parameter with respect to predictive accuracy, in order get the best prediction
for body fat. Note that tuning \R{mstop} also leads to models including only the
most informative predictors as variable selection is carried out simultaneously.
We therefore first fit a model with all available predictors and then tune
\R{mstop} by 25-fold bootstrapping. Using the \R{baselearner} argument in
\R{gamboost()}, we specify the default base-learner which is used for each
variable in the formula for which no base-learner is explicitly specified.
<>=
## every predictor enters the model via a bbs() base-learner (i.e., as smooth effect)
gam2 <- gamboost(DEXfat ~ ., baselearner = "bbs", data = bodyfat,
control = boost_control(trace = TRUE))
@
<>=
## refit model to supress trace in cvrisk
## this is only required for the output in the PDF
gam2 <- gamboost(DEXfat ~ ., baselearner = "bbs", data = bodyfat)
@
<>=
set.seed(123) ## set seed to make results reproducible
cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation
## if package 'multicore' is not available this will trigger a warning
cvm
plot(cvm) ## get the paths
@
\setkeys{Gin}{width = .45\textwidth}
\begin{figure}[ht]
\centering
<>=
## same as plot from chunk "mod_gamboost2_ctd"with better margins
par(mar = c(4, 4, 0, 0) + 0.1)
plot(cvm, main = "")
@
\caption{Cross-validated predictive risk with 25-fold bootstrapping.\label{fig:cvpaths}}
\end{figure}
\noindent The plot displays the predictive risk on the 25 bootstrap samples for
$m_{\text{stop}} = 1,...,100$ (see Figure~\ref{fig:cvpaths}). The optimal
stopping iteration is the one minimizing the average risk over all 25 samples.
We can extract this iteration via
<>=
mstop(cvm) ## extract the optimal mstop
@
<>=
gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop
@
<>=
m_cvm <- mstop(cvm)
@
We have now reduced the model of the object \R{gam2} to the one with only \Sexpr{m_cvm}
boosting iterations, without further assignment. However, as pointed out above,
the other iterations are not lost. To check which variables are now included in
the additive predictor we again use the function \R{coef()}:
<>=
## refit model to get trace again
## this is only required for the output in the PDF
gam2 <- gamboost(DEXfat ~ ., baselearner = "bbs", data = bodyfat,
control = boost_control(trace = TRUE))
gam2[ mstop(cvm) ]
@
<>=
n_coef <- length(coef(gam2))
coef_string <-
paste(c("zero", "one", "two", "three", "four",
"five", "six", "seven", "eight", "nine")[n_coef + 1],
if (n_coef == 1) "predictor" else "predictors")
@
<>=
names(coef(gam2)) ## displays the selected base-learners at iteration "mstop(cvm)"
## To see that nothing got lost we now increase mstop to 1000:
gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)"
@
Although we earlier had reduced to iteration \Sexpr{m_cvm}, the fitting algorithm started
at iteration 101. The iterations \Sexpr{m_cvm+1}--100 are not re-computed.
<>=
names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000
@
The stopping iteration \R{mstop} is the main tuning parameter for boosting and
controls the complexity of the model. Larger values for \R{mstop} lead to larger
and more complex models, while for smaller values the complexity of the model is
generally reduced. In our example the final model at iteration 1000 includes all
available variables as predictors for body fat, while the model at the optimal
iteration \Sexpr{m_cvm} included only \Sexpr{coef_string}. Optimizing the stopping iteration usually
leads to selecting the most influential predictors.
\subsection{Specifying the Fitting Problem: The \R{family}}
\label{sec:family}
The list of implemented families in \textbf{mboost} is diverse and wide-ranging.
At the time of writing this paper, the user has access to sixteen different
families. An overview is given in Table~\ref{tab:families}. A family (most
importantly) implements the loss function $\rho$ and the corresponding negative
gradient. A careful specification of the loss function leads to the estimation
of any desired characteristic of the conditional distribution of the response.
This coupled with the large number of base learners guarantees a rich set of
models that can be addressed by boosting. We can specify the connection between
the response and the covariates in a fairly modular nature such as
\begin{equation*}
\xi(y|\x) = \hat{f}_1 + \cdots + \hat{f}_P
\end{equation*}
having on the right hand side any desired combination of base learners. On the
left hand side, $\xi(.)$ describes \emph{some} characteristic of the conditional
distribution specified by the \R{family} argument. In the following subsections
we discuss major aspects related to the choice of the family.
\begin{table}[h!]
\begin{threeparttable}[b]
\centering
\caption{An overview on the currently implemented families in \textbf{mboost}.
\label{tab:families}}
\begin{tabular}{llllll}
\toprule
& continuous response & binary response & count data & ordered response & censored data \\
\cmidrule{2-6}
Standard regression & \R{Gaussian} & & & & \\
Median regression & \R{Laplace} & & & & \\
Quantile regression& \R{QuantReg} & & & & \\
Expectile regression & \R{ExprectReg} & & & & \\
Robust regression & \R{Huber} & & & & \\
Gamma regression\tnote{a} & \R{GammaReg} & & & & \\
Logistic regression& & \R{Binomial} & & & \\
AdaBoost classification & &\R{AdaExp} & & & \\
AUC regression & &\R{AUC} & & & \\
Poisson regression & &&\R{Poisson} & & \\
Negative binomial model & &&\R{NBinomial} & & \\
Proportional odds model & &&&\R{ProppOdds} & \\
Proportional hazards model & &&& &\R{CoxPH} \\
Weibull AFT\tnote{b}\;\, model & &&& &\R{Weibull} \\
Log-logistic AFT\tnote{b}\;\, model & &&& &\R{Loglog} \\
Log-normal AFT\tnote{b}\;\, model & &&& &\R{Lognormal} \\
\bottomrule
\end{tabular}
\begin{tablenotes}
\item[a] for non-negative continuous response
\item[b] accelerated failure time
\end{tablenotes}
\end{threeparttable}
\end{table}
\subsubsection{Families for Continuous Response}\label{sec:famcont}
\begin{figure}[t]
\subfigure[\R{family = Gaussian()}
\label{fig:fam_gauss}]{
\includegraphics[page=1]{graphics/fig-family}
}
\subfigure[\R{family = Laplace()}
\label{fig:fam_laplace}]{
\includegraphics[page=2]{graphics/fig-family}
}
\caption{The loss function allows flexible specification of the link between
the response and the covariates. The figure on the left hand side
illustrates the L$_2$ loss (the default in \textbf{mboost}), the figure on
the right hand side shows the L$_1$ loss function.}
\label{fig:family1}
\end{figure}
Until now the focus was on the conditional mean of a continuous response which
is the default setting: \R{family = Gaussian()}. In this case, our assumption is
that $Y|\x$ is normally distributed and the loss function is the negative
Gaussian log-likelihood which is equivalent to the $\text{L}_2$ loss
\begin{equation*}
\rho(y,f) =\frac{1}{2}(y-f)^2
\end{equation*}
(see Figure~\ref{fig:fam_gauss}). A plain \R{Gaussian()} call in the console
returns its definition
\begin{Sinput}
R> Gaussian()
\end{Sinput}
\begin{Soutput}
Squared Error (Regression)
Loss function: (y - f)^2
\end{Soutput}
The corresponding negative gradient is simply $(y-f)$ whose definition
can be displayed on the screen via
\begin{Sinput}
R> slot(Gaussian(),"ngradient")
\end{Sinput}
\begin{Soutput}
function (y, f, w = 1)
y - f
\end{Soutput}
If we are interested in the median of the conditional distribution, the
\R{Laplace()} family is the right choice. It implements a distribution free,
median regression approach especially useful for long-tailed error
distributions. In this case, we use the L$_1$ loss defined as
\begin{equation*}
\rho(y,f) = |y-f|
\end{equation*}
and shown in Figure~\ref{fig:fam_laplace}. Note that the L$_1$ loss is not
differentiable at $y = f$ and the value of the negative gradient at such points
is fixed at zero.
\begin{figure}[t]
\subfigure[\R{family = Huber()}
\label{fig:fam_huber}]{
\includegraphics[page=3]{graphics/fig-family}
}
\subfigure[\R{family = QuantReg()}
\label{fig:fam_quantreg}]{
\includegraphics[page=4]{graphics/fig-family}
}
\caption{The Huber loss function on the left hand side is useful when
robustness is a concern. In \textbf{mboost} it adaptively changes the
limit for L$_1$ penalization of outliers when \R{d = NULL} (the default).
The figure on right hand side illustrates several examples of the check
function loss with different quantiles (\R{tau = 0.5} is the default).}
\label{fig:family2}
\end{figure}
A compromise between the L$_1$ and the L$_2$ loss is the Huber loss function
shown in Figure~\ref{fig:fam_huber}. It is defined as
\begin{equation*}
\rho(y,f; \delta) =
\begin{cases}
(y-f)^2/2 & \text{if $|y - f| \le \delta$,}\\
\delta(|y-f|-\delta /2) &\text{if $|y - f| > \delta$}
\end{cases}
\end{equation*}
where the parameter $\delta$ limits the outliers which are subject to absolute
error loss. The Huber loss can be seen as a robust alternative to the L$_2$
loss. The user can either specify $\delta$ subjectively, e.g. \R{Huber(d = 2)},
or leave it adaptively chosen by the boosting algorithm (the default behaviour).
An adaptive specification of $\delta$, proposed by \citet{friedman2001}, means
that each boosting step produces a new $\delta^{[m]}$ matching the actual median
of the absolute values of the residuals, i.e.
\begin{equation*}
\delta^{[m]} = \text{median}\left(\Bigl|y_i - \hat{f}^{[m-1]}(x_i)\Bigr|, i=1,\ldots,n
\right).
\end{equation*}
Another alternative for settings with continuous response is modeling
conditional quantiles through quantile regression \citep{Koen2005} --
implemented in \textbf{mboost} with the \R{QuantReg()} family
\citep{Fenske:2011}. The main advantage of quantile regression is (beyond its
robustness towards outliers) that it does not rely on any distributional
assumptions on the response or the error terms. The appropriate loss function
here is the check-function shown in Figure~\ref{fig:fam_quantreg}. For the
special case of the 0.5 quantile both \R{QuantReg(0.5)} and \R{Laplace()} will
lead to median regression. A detailed description of the loss function of
quantile regression is given in the Appendix.
\paragraph{Case Study (ctd.): Prediction of Body Fat}\hfill\\
We reproduce the model of the original publication \citep{garcia}, but instead
of modeling the mean we focus on the median:
<>=
## Same model as glm1 but now with QuantReg() family
glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,
family = QuantReg(tau = 0.5), control = boost_control(mstop = 500))
coef(glm3, off2int = TRUE)
@
Comparing the results to those of model \R{glm1} shows that \R{hipcirc} and
\R{anthro3a} have almost the same influence on the mean as on the median, yet,
the magnitude of the effect of \R{kneebreadth} is considerably smaller in
\R{glm3}. One should note that \R{mstop} generally needs to be larger for
quantile regression, as the single updates are smaller than in the mean
regression case. For a discussion see the Appendix.
\subsubsection{Families for Binary Response}
Analogously to Gaussian regression, the probability parameter of a binary
response can be estimated by minimizing the negative binomial log-likelihood
\begin{align}
\label{eq:binary}
\rho(y, f) & = -\left[y \:\log\bigl( \pi(f) \bigr) + (1-y)\:\log\bigl(
1-\pi(f) \bigr)\right] \notag \\
& = \log(1 + \exp(-2\,\tilde{y}f))
\end{align}
where $\tilde{y} = 2 y -1$ and $\pi(f) = \mathbb{P}(Y = 1| \x )$. For reasons of
computational efficiency the binary response $y \in\{0,1\}$ is converted into
$\tilde{y} = 2y -1$ where $\tilde{y} \in \{-1,1\}$. In
equation~\eqref{eq:binary}, $\tilde{y}f$ are the so-called margin values
(depicted in Figure~\ref{fig:fam_binary}) which are, roughly speaking, the
equivalent of the continuous residuals $y-f$ for the binomial case. This
internal re-coding\footnote{Note that in \textbf{mboost} the response must be
specified as a binary \R{factor}.} means that the negative binomial
log-likelihood loss (\R{family = Binomial()}) and the exponential loss
(\R{family = AdaExp()}) coincide in their population minimizer
\citep[see][Section 3]{BuhlmannHothorn06}.
Note that the transformation $\tilde{y} = 2y -1$ changes the interpretation of
the effect estimates because now we get the half of the log-odds ratio. One
implication is that the \R{coef()} output is half the estimates that result
from \R{glm()}. This means that the user has to double the coefficients
\emph{manually} in order to get the final standard estimates of a logistic
regression model.
\begin{figure}[t]\centering
\includegraphics[page=5]{graphics/fig-family}
\caption{The \R{Binomial} and the \R{AdaExp} families as functions of the
marginal values $\tilde{y}f$. Since $\tilde{y} \in \{-1,1\}$, a positive
product between $\tilde{y}$ and half the estimated log-odds ratio $f$
means correct classification.}
\label{fig:fam_binary}
\end{figure}
However, \textbf{mboost} automatically doubles the logits prior to the
reverse probability transformation. This means that calling \mbox{\R{predict(fit, newdata,
type = "response")}} produces the \emph{final} probability estimations. In
addition to the logit model, the user can also estimate probit models using
\R{family = Binomial(link = "probit")}.
Alternatively one can also use the exponential loss function $\rho(y,f) =
\exp(-\tilde{y}f)$ (\R{family = AdaExp()}). This essentially leads to the famous
AdaBoost algorithm by \citet{nr:freund.schapire:1996}. As can be seen in
Figure~\ref{fig:fam_binary}, this loss function is similar to \R{Binomial()}.
\subsubsection{Families for Count Data and Censored Response}
In \textbf{mboost}, currently two families handle data with count response. The
\R{Poisson()} family uses the negative Poisson log-likelihood with the natural
link function $\log(\mu) = \eta$. Alternatively, the negative binomial
distribution can be used to model overdispersed data. The
negative log-likelihood density of this distribution is implemented in
\R{NBinomial(nuirange = c(0, 100))} where the parameter
\R{nuirange} (accounting for overdispersion) is optimized additionally within each boosting iteration $m$. One
simply minimizes the empirical risk $\mathcal{R}$ over the overdispersion
parameter given the current boosting estimate $\hat{f}^{[m]}$ after step~4 of
the component-wise gradient boosting algorithm. A thorough introduction and the
detailed algorithm is given by \citet{Schmid:Multidim:2010}.
Survival models can also be considered in \textbf{mboost}: \R{CoxPH()},
\R{Weibull()}, \R{Loglog()} and \R{Lognormal()} all implement families for
censored data. \R{CoxPH()} implements the Cox proportional hazards model while
the other three families specify accelerated failure time (AFT) models
\citep[see][for further details]{Schmid:Hothorn:AFT-boost}.
\subsubsection{Further Families}
Additionally to the families discussed above, \textbf{mboost} implements some
further families: \R{AUC()} can be used to optimize the area under the ROC
curve, \R{GammaReg()} implements the negative Gamma log-likelihood with
logarithmic link function, \R{ExpectReg()} implements expectile regression
\citep{Sobotka:2010} and \R{PropOdds()} leads to proportional odds models for
ordinal outcome variables \citep{Schmid:2010:PropOdds}.
Despite the wide range of available families, users might wish to implement new
loss functions. One only needs a differentiable loss function and the
corresponding negative gradient $\delta \rho(y, f)\, / \, \delta f$, which both
are independent of the base-learners. Using the constructor function
\R{Family()}, one can then easily define new families and thus new estimation
problems as we show in the Appendix.
\section{Summary}
\label{sec:summary}
The \textsf{R} package \textbf{mboost} offers an easy entry into the world of
boosting. It implements a model-based boosting approach that results in
interpretable structured additive models of the same form most researchers will
feel familiar with. The interfaces of fitting functions are quite similar to
standard implementations like \R{lm()} or \R{glm()} and are hence relatively
easy to use. However, the fitting algorithms of \textbf{mboost} additionally
offer a high flexibility when it comes to the effect type of potential
predictors and the type of risk function to be optimized. There exist a large
number of pre-defined families for various risk functions as well as a large
number of pre-defined base-learners to specify various types of effects. As
\textbf{mboost} has a modular nature, both can be combined in any form as
desired by the user: For many model classes, \textbf{mboost} therefore offers
much more modeling alternatives than the classical fitting algorithms.
Additionally, the user can also easily extend \textbf{mboost} by implementing
new families or base-learners.
As seen in the case study, many functions for the manipulation and extraction of
the results are available. These allow the user to fit, tune and finally
interpret the model. We note that the present tutorial has been designed as an
introduction to the basic functionalities of the \textbf{mboost} package,
highlighting its usage from a practical perspective. Readers who are interested
in further illustrations, as well as in a more technical description of the
\textbf{mboost} package, are referred to the package manual and the vignettes
that are found at \url{http://cran.r-project.org/package=mboost} or in
\textsf{R} via \R{vignette(package = "mboost")}.
\bibliographystyle{abbrvnat}
\bibliography{mboost_tutorial} % name your BibTeX data base
\clearpage
\section*{Appendix: Building Your Own Family}
Via the constructor function \R{Family()}, in \textbf{mboost} there exists an
easy way for the user to set up new families. The main required arguments are
the \R{loss} to be minimized and the negative gradient (\R{ngradient}) of the
loss. The risk is then commonly defined as the sum of the loss over all
observations.
\begin{Schunk}
\begin{Sinput}
Family(ngradient, loss = NULL, risk = NULL, offset = function(y, w)
optimize(risk, interval = range(y), y = y, w = w)$minimum, ...)
\end{Sinput}
\end{Schunk}
% just to end the highlighting of math mode in latex $
We will demonstrate the usage of this function by (re-) implementing the family
to fit quantile regression (the pre-defined family is \R{QuantReg()}). In
contrast to standard regression analysis, quantile regression \citep{Koen2005}
does not estimate the conditional expectation of the conditional distribution
but the conditional quantiles. Estimation is carried out by minimizing the check
function $\rho_{\tau}(\cdot)$:
$$
\rho_{\tau}(y_i - f_{\tau i} ) = \left\{\begin{array}{ll}
(y_i - f_{\tau i} ) \cdot\tau & \quad (y_i - f_{\tau i} ) \geq 0 \\
(y_i - f_{\tau i} ) \cdot(\tau-1) & \quad (y_i - f_{\tau i} ) <0,
\end{array} \right.
$$
which is depicted in Figure~\ref{fig:fam_quantreg}. The \R{loss} for our new
family is therefore given as:
<>=
loss = function(y, f) tau * (y - f) * ((y - f) >= 0) +
(tau - 1) * (y - f) * ((y - f) < 0)
@
The check-function is not differentiable at the point 0. However in practice, as
the response is continuous, we can ignore this by defining:
$$ - \frac{\partial \rho_{\tau}(y_i, f_{\tau i})}{\partial f} =
\left\{\begin{array}{ll}
\tau & (y_i - f_{\tau i}) \geq 0 \\
\tau-1 & (y_i - f_{\tau i}) <0.
\end{array} \right. $$
The negative gradient of our loss is therefore\footnote{The unused
weights argument \R{w} is required to exist by \textbf{mboost} when the
function is (internally) called. It is hence 'specified' as \R{NULL}.}:
<>=
ngradient = function(y, f, w = NULL) tau * ((y - f) >= 0) +
(tau- 1) * ((y - f) < 0)
@
Of further interest is also the starting value for the algorithm, which is
specified via the \R{offset} argument. For quantile regression it was
demonstrated that the offset may be set to the median of the response
\citep{Fenske:2011}. With this information, we can already specify our new family
for quantile regression:
<>=
OurQuantReg <- function(tau = 0.5){ ## function to include dependency on tau
Family( ## applying the Family function
loss = function(y, f) ## loss as above
tau * (y - f) * ((y - f) >= 0) +
(tau - 1) * (y - f) * ((y - f) < 0) ,
ngradient = function(y, f, w = NULL) ## ngradient as above
tau * ((y - f) >= 0) + (tau - 1) * ((y - f) < 0),
offset = function(y, w = NULL) ## median as offset
quantile(y, p = 0.5),
name = "Our new family for quantile regression" )}
OurQuantReg()
@
\paragraph{Case Study (ctd.): Prediction of Body Fat}\hfill\\
To try our new family we go back to the case study regarding the prediction of
body fat. First, we reproduce the model for the median, computed with the
pre-defined \R{QuantReg()} family (see Section~\ref{sec:famcont}), to show that
our new family delivers the same results:
<>=
## Same model as glm3 but now with our new family
glm3b <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,
family = OurQuantReg(tau = 0.5),
control = boost_control(mstop = 500))
identical(coef(glm3b), coef(glm3))
@
To get a better idea of the shape of the conditional distribution we model the
median, and the 0.05 and 0.95 quantiles in a small, illustrative example
containing only the predictor \R{hipcirc}:
<>=
glm4a <- glmboost(DEXfat ~ hipcirc, family = OurQuantReg(tau = 0.05), data = bodyfat,
control = boost_control(mstop = 2000))
glm4b <- glmboost(DEXfat ~ hipcirc, family = OurQuantReg(tau = 0.5), data = bodyfat,
control = boost_control(mstop = 2000))
glm4c <- glmboost(DEXfat ~ hipcirc, family = OurQuantReg(tau = 0.95), data = bodyfat,
control = boost_control(mstop = 2000))
@
Note that for different quantiles, fitting has to be carried out separately, as
$\tau$ enters directly in the loss. It is also important that fitting quantile
regression generally requires higher stopping iterations than standard
regression with the $L_2$ loss, as the negative gradients which are fitted to
the base-learners are vectors containing only small values, i.e., $\tau$ and
$1-\tau$.
<>=
ord <- order(bodyfat$hipcirc) ## order the data to avoid problems when plotting
plot(bodyfat$hipcirc[ord], bodyfat$DEXfat[ord]) ## observed data
lines(bodyfat$hipcirc[ord], fitted(glm4a)[ord], lty = 2, lwd = 2) ## 0.05 quantile
lines(bodyfat$hipcirc[ord], fitted(glm4b)[ord], lty = 1, lwd = 2) ## median
lines(bodyfat$hipcirc[ord], fitted(glm4c)[ord], lty = 2, lwd = 2) ## 0.95 quantile
@
\setkeys{Gin}{width = .45\textwidth}
\begin{figure}[ht]
\centering
<>=
## same plot but with better margings and parameters
par(mar = c(4, 4, 0, 0) + 0.1)
plot(bodyfat$hipcirc[ord], bodyfat$DEXfat[ord])
lines(bodyfat$hipcirc[ord], fitted(glm4a)[ord], lty=2,lwd=2)
lines(bodyfat$hipcirc[ord], fitted(glm4b)[ord], lty=1, lwd=2)
lines(bodyfat$hipcirc[ord], fitted(glm4c)[ord], lty=2, lwd=2)
@
\caption{Resulting quantile regression lines, for the median (solid line) and the 0.95 and 0.05 quantiles (upper and lower dashed lines). \label{fig:quantreg}}
\end{figure}
\noindent The resulting plot (see Figure~\ref{fig:quantreg}) shows how quantile regression
can be used to get a better impression of the whole conditional distribution
function in a regression setting. In this case, the upper and lower quantiles
are not just parallel lines to the median regression line but adapt nicely to
the slight heteroscedasticity found in this data example: For smaller values of
\R{hipcirc} the range between the quantiles is smaller than for higher values.
Note that the outer quantile-lines can be interpreted as prediction intervals for new
observations \citep{Meins2006, MayrPI}. For more on quantile regression in the context of
boosting we refer to \cite{Fenske:2011}.
\end{document}
%% not part of the document; only used to produce some of the figures;
<>=
################################################################################
## the following chunks produce the graphics that are NOT part of the bodyfat ##
## example ##
################################################################################
@
\SweaveOpts{echo=false, results=hide, fig=true, prefix.string=graphics/fig}
<>=
red <- rgb(103,0,31, max = 255) ## define red color
@
<>=
## load library mboost
library("mboost")
library("RColorBrewer")
## define colors
cols <- paste(brewer.pal(11,"RdBu")[c(10,2)], "E6", sep="")
## create graphics to show importance of centering
set.seed(1907)
x <- rnorm(50, mean = 5)
y <- rnorm(50, mean = x, sd = 0.3)
## subtract offset as usual in boosting
y <- y - mean(y)
par(ps = 8, cex=1, cex.lab=1, mar=c(3.1,3,0.5,0.1), mgp=c(2,1,0))
xrange <- range(0,x)
plot(y ~ x, xlim = xrange, cex = 1,
xlab = "x", ylab = "negative gradient in step m = 1",
pch = 20,
col = rgb(0.5, 0.5, 0.5, alpha = 0.8))
abline(h = 0, col = "gray", lwd = 0.5)
abline(v = 0, col = "gray", lwd = 0.5)
abline(lm(y ~ x -1), col = cols[1], lwd = 1.5)
points(0, 0, col = cols[2], lwd = 1.5)
points(mean(x), mean(y), col = cols[2], pch = 3, lwd = 1.5)
legend(0.1, 2.35, legend = c("origin", "center of data", "base-learner"),
pch = c(21, 3, -1), lty = c(-1, -1, 1),
col = c(cols[2], cols[2], cols[1]), lwd = 1.5, bg = "white", bty = "n")
@
<>=
## create graphics to show importance of centering
set.seed(1907)
x <- rnorm(50, mean = 5)
y <- rnorm(50, mean = x, sd = 0.3)
## subtract offset as usual in boosting
y <- y - mean(y)
## figure with centering of x
mx <- mean(x)
xrange <- range(0,x)
x <- x - mean(x)
par(ps = 8, cex=1, cex.lab=1, mar=c(3.1,3,0.5,0.1), mgp=c(2,1,0))
plot(y ~ x, xlim = xrange - mx, cex = 1,
xlab = "x (centered)", ylab = "negative gradient in step m = 1",
pch = 20,
col = rgb(0.5, 0.5, 0.5, alpha = 0.8))
abline(h = 0, col = "gray", lwd = 0.5)
abline(v = 0, col = "gray", lwd = 0.5)
abline(lm(y ~ x -1), col = cols[1], lwd = 1.5)
points(0, 0, col = cols[2], lwd = 1.5)
points(mean(x), mean(y), col = cols[2], pch = 3, lwd = 1.5)
@
<>=
### Simulate some data
set.seed(1907)
n <- 100
x1 <- rnorm(n)
x2 <- as.factor(sample(0:3, n, replace = TRUE))
y <- 0.5 * x1 + rnorm(n)
mod <- gamboost(y ~ bols(x1), control = boost_control(mstop = 25))
par(mar = c(4, 4, 0, 0) + 0.1)
plot(sort(x1), (0.5 * x1)[order(x1)], type = "l", lwd = 2,
xlab = expression(x[1]), ylab = expression(f(x[1])))
lines(sort(x1), fitted(mod, which = 1)[order(x1)], col = red,
lwd = 2, type = "b", pch = 20)
legend("topleft", c("true effect", "model"),
lty = c(1, 1), pch = c(-1, 20), merge = TRUE,
lwd = 2,
col = c("black", red), bty = "n")
@
<>=
beta <- c(0, -1, 0.5, 3)
y <- drop(model.matrix(~ x2) %*% beta + rnorm(n, sd = 0.3))
mod <- gamboost(y ~ bols(x2), control = boost_control(mstop = 50))
par(mar = c(4, 4, 0, 0) + 0.1)
betaPred <- coef(mod)[[1]]
betaPred[1] <- betaPred[1] + attr(coef(mod), "offset")
betaTrue <- c(0, -1, 0.5, 3)
plot(1:4, betaPred, type = "n", xaxt = "n",
xlab = expression(x[2]), ylab = expression(f(x[2])),
xlim = c(0.5, 4.5), ylim = c(-1.5, 3.5))
axis(1, at = 1:4, labels = expression(x[2]^(1), x[2]^(2), x[2]^(3), x[2]^(4)))
for (i in 1:4)
lines(i + c(-0.38, 0, 0.38), rep(betaPred[i], each = 3),
lwd = 2, col = red, type = "b", pch = 20)
for (i in 1:4)
lines(i + c(-0.4, 0.4), rep(betaTrue[i], each = 2),
lwd = 2, col = "black")
legend("topleft", c("true effect", "model"),
pch = c(-1,20), lty = c(1, 1), lwd = 2,
col = c("black", red), bty = "n")
@
<>=
set.seed(1907)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n) + 0.25 * x1
x3 <- as.factor(sample(0:1, n, replace = TRUE))
x4 <- gl(4, 25)
y <- 3 * sin(x1) + x2^2 + rnorm(n)
## specify knot locations for bbs(x2, ...)
knots.x2 <- quantile(x2, c(0.25, 0.5, 0.75))
## fit the model
mod <- gamboost(y ~ bbs(x1, knots = 20, df = 4) +
bbs(x2, knots = knots.x2, df = 5) +
bols(x3) + bols(x4))
## plot partial effects
par(mar = c(4, 4, 0, 0) + 0.1)
plot(sort(x1), (3 * sin(x1))[order(x1)], type = "l", lwd = 2,
xlab = expression(x[1]), ylab = expression(f[1](x[1])))
lines(sort(x1), fitted(mod, which = 1)[order(x1)], col = red, lwd = 2,
type = "b", pch = 20)
legend("topleft", c("true effect", "model"),
lty = c(1, 1), pch = c(-1,20), merge = TRUE, lwd = 2,
col = c("black", red), bty = "n")
@
<>=
par(mar = c(4, 4, 0, 0) + 0.1)
plot(sort(x2), (x2^2)[order(x2)], type = "l", lwd = 2,
xlab = expression(x[2]), ylab = expression(f[2](x[2])))
## offset needed to conserve correct level (otherwise center both effects around
## zero to make them comparable)
lines(sort(x2), fitted(mod, which = 2)[order(x2)] + mod$offset, col = red, lwd = 2,
type = "b", pch = 20)
@
<>=
### example for cyclic spline
set.seed(1907)
true_f <- function(x)
cos(x) + 0.25 * sin(4 * x)
x <- runif(150, 0, 2 * pi)
y <- rnorm(150, mean = true_f(x), sd=0.1)
newX <- seq(0, 2*pi, length = 200)
mod <- gamboost(y ~ bbs(x, knots = 20, degree = 4))
mod[3000]
mod_cyclic <- gamboost(y ~ bbs(x, cyclic=TRUE, knots = 20,
degree = 4, boundary.knots=c(0, 2*pi)))
mod_cyclic[3000]
par(mar = c(4, 4, 0, 0) + 0.1)
plot(x,y, ylab = "f(x)",
pch = 20, xlim = c(0, 7),
col = rgb(0.5, 0.5, 0.5, alpha = 0.8))
lines(newX, predict(mod, data.frame(x = newX)),
col = "black", lwd = 2)
lines(newX + 2 * pi, predict(mod, data.frame(x = newX)),
col = "black", lwd = 2)
legend("bottomleft", c("cyclic = FALSE"),
lty = c(1), col = c("black"), lwd = 2, bty = "n")
@
<>=
par(mar = c(4, 4, 0, 0) + 0.1)
plot(x, y, ylab = "f(x)",
pch = 20, xlim = c(0, 7),
col = rgb(0.5, 0.5, 0.5, alpha = 0.8))
lines(newX, predict(mod_cyclic, data.frame(x = newX)),
col = red, lwd = 2)
lines(newX + 2 * pi, predict(mod_cyclic, data.frame(x = newX)),
col = red, lwd = 2)
abline(v = 2 * pi, col = "gray", lty = "dashed", lwd = 2)
legend("bottomleft", c("cyclic = TRUE"),
lty = c(1), col = c(red), lwd = 2, bty = "n")
@
<>=
set.seed(1907)
x1 <- runif(250,-pi,pi)
x2 <- runif(250,-pi,pi)
y <- sin(x1) * sin(x2) + rnorm(250, sd = 0.4)
modspa <- gamboost(y ~ bspatial(x1, x2, knots = list(x1 = 12, x2 = 12)))
# possible to specify different knot mesh for x1 and x2:
# modspa <- gamboost(y ~ bspatial(x1, x2, knots = list(x1 = 12, x2 = 20)))
library(lattice)
library(RColorBrewer)
## set up colorscheme
nCols <- 50
cols <- colorRampPalette(rev(brewer.pal(11, "RdBu")), space = "Lab",
interpolate = "spline")(nCols)
## make new data on a grid:
nd <- expand.grid(x1 = seq(-pi, pi, length = 100),
x2 = seq(-pi, pi, length = 100))
preds <- predict(modspa, newdata = nd)
breaks <- seq(min(preds), max(preds), length = nCols + 1)
print(levelplot(preds ~ nd$x1 + nd$x2,
xlab = expression(x[1]),
ylab = expression(x[2]),
at = breaks,
col.regions = cols, cex = 0.7,
colorkey = list(space = "left", at = breaks, width = 1)))
@
<>=
x1 <- x2 <- seq(-pi, pi, length = 50)
nd <- expand.grid(x1 = x1,
x2 = x2)
preds <- predict(modspa, newdata = nd)
z <- matrix(preds, ncol = length(x1))
nrz <- nrow(z)
ncz <- ncol(z)
jet.colors <- colorRampPalette(paste(brewer.pal(11,"RdBu")[c(10,2)],
"E6", sep=""))
nbcol <- 10
color <- jet.colors(nbcol)
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]
facetcol <- cut(zfacet, nbcol)
par(mar=c(1, 0.1, 0.1, 0.1), mgp = c(1.8,1,0))
persp(x1, x2, z,
theta = 45, phi = 30, expand = 0.5,
col = color[facetcol], ticktype = "detailed",
nticks = 3,
xlab = "x1", ylab = "x2")
@
<>=
pdf("./graphics/fig-family.pdf", width = 5, height = 4)
par(mar = c(4, 4, 0, 0) + 0.1)
## Gaussian
x <- seq(-2, 2, length = 200)
plot(x,x^2, type = "l", ylab = expression(rho), xlab = expression(y-f))
## Laplace
x <- seq(-2, 2, length = 200)
lossL1 <- function(x, param) abs(x)
mp <- 0.5
dat <- data.frame(x = rep(x, length(mp)),
y = as.numeric(sapply(mp, function(z) lossL1(x, param = z))),
param = rep(mp, each = length(x)))
dat$tau <- factor(dat$param)
plot(x, dat$y, type = "l", ylab = expression(rho), xlab = expression(y-f))
## Huber
x <- seq(-2, 2, length = 200)
lossH <- function(x, param) ifelse(abs(x) <= param, (1/2) * x^2, param * (abs(x) - param / 2))
mp <- c(0.2, 0.5, 1, 2, 10)
dat <- data.frame(x = x, sapply(mp, function(z) lossH(x, param = z)))
plot(x, dat$X1, type = "l", ylim = c(0,2), ylab = expression(rho),
xlab = expression(y-f))
legend(0, 2, xjust = 0.5, legend = paste(mp), title = expression(delta),
lty = 1, col = c(1,3,4,5,6), cex = 0.6, box.col = "gray")
for(i in 1:(length(mp)-1)){
lines(x, dat[,i+2], col = i+2)
}
## Quantile
x <- seq(-2, 2, length = 200)
lossL1 <- function(x, param) ifelse(x >= 0, param * x, (param - 1) * x)
mp <- c(5:9 / 10)
dat <- data.frame(x = x, sapply(mp, function(z) lossL1(x, param = z)))
plot(x, dat$X1, type = "l", ylab = expression(rho), ylim = c(0,2),
xlab = expression(y-f))
legend(0,2, xjust = 0.5, legend = paste(mp), title = expression(tau),
lty = 1, col = c(1,3,4,5,6), cex = 0.6, box.col = "gray")
for(i in 1:(length(mp)-1)){
lines(x, dat[,i+2], col = i+2)
}
## Binary
x <- seq(-2, 2, length = 200)
dat <- data.frame(x = x, Binomial = log(1+exp(-2*x), base=2), AdaExp = exp(-x))
plot(x, dat$Binomial, type = "l", lty = 1, col = 1, xlab = expression(tilde(y) * f),
ylab = expression(rho), ylim = c(0,7.5))
lines(x, dat$AdaExp, lty = 2, col = 2)
legend("topright", legend = c("Binomial","AdaExp"), xjust = 0.5, title = "loss",
lty = 1:2, col = 1:2, cex = 0.8, bty = "n")
dev.off()
@