\documentclass{article}
\usepackage{amstext}
\usepackage{amsfonts}
\usepackage{hyperref}
\usepackage[round]{natbib}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{rotating}
%%\usepackage[nolists]{endfloat}
%%\usepackage{Sweave}
%%\VignetteIndexEntry{mboost Illustrations}
%%\VignetteDepends{mboost, survival, TH.data}
\newcommand{\Rpackage}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rcmd}[1]{\texttt{#1}}
\newcommand{\Roperator}[1]{\texttt{#1}}
\newcommand{\Rarg}[1]{\texttt{#1}}
\newcommand{\Rlevel}[1]{\texttt{#1}}
\newcommand{\RR}{\textsf{R}}
\renewcommand{\S}{\textsf{S}}
\newcommand{\df}{\mbox{df}}
\RequirePackage[T1]{fontenc}
\RequirePackage{graphicx,ae,fancyvrb}
\IfFileExists{upquote.sty}{\RequirePackage{upquote}}{}
\usepackage{relsize}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{baselinestretch=1}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier,
baselinestretch=1,
fontshape=it,
fontsize=\relsize{-1}}
\DefineVerbatimEnvironment{Scode}{Verbatim}{}
\newenvironment{Schunk}{}{}
\renewcommand{\baselinestretch}{1}
\SweaveOpts{engine=R,eps=FALSE}
\hypersetup{%
pdftitle = {mboost Illustrations},
pdfsubject = {package vignette},
pdfauthor = {Torsten Hothorn and Peter Buhlmann},
%% change colorlinks to false for pretty printing
colorlinks = {true},
linkcolor = {blue},
citecolor = {blue},
urlcolor = {red},
hyperindex = {true},
linktocpage = {true},
}
\begin{document}
\setkeys{Gin}{width=\textwidth}
\title{\Rpackage{mboost} Illustrations}
\author{Torsten Hothorn$^1$ and Peter B\"uhlmann$^2$}
\date{}
\maketitle
\noindent$^1$ Institut f\"ur Statistik \\
Ludwig-Maximilians-Universit\"at M\"unchen \\
Ludwigstra{\ss}e 33, D-80539 M\"unchen, Germany \\
\texttt{Torsten.Hothorn@R-project.org}
\newline
\noindent$^2$ Seminar f{\"u}r Statistik \\
ETH Z\"urich, CH-8092 Z{\"u}rich, Switzerland \\
\texttt{buhlmann@stat.math.ethz.ch}
\newline
\section{Illustrations}
This document reproduces the data analyses presented in
\cite{BuhlmannHothorn06}. For a description of the theory behind
applications shown here we refer to the original manuscript. The results
differ slightly due to technical changes or bugfixes in \textbf{mboost}
that have been implemented after the paper was printed. Most important,
\texttt{gamboost} uses penalized $B$-splines instead of smoothing
splines as baselearners. The computations are much faster
and the results differ only slightly \citep{Schmid+Hothorn:2008b}.
\SweaveOpts{prefix.string=figures/BH}
\paragraph{Illustration: Prediction of total body fat}
<>=
source("setup.R")
@
\citet{garcia2005} report on the development of predictive regression equations
for body fat content by means of $p = 9$ common anthropometric
measurements which were obtained for $n = 71$ healthy German women.
In addition, the women's body composition was measured by
Dual Energy X-Ray Absorptiometry (DXA). This reference method
is very accurate in measuring body fat but finds little applicability
in practical environments, mainly because of high costs and the
methodological efforts needed. Therefore, a simple regression equation
for predicting DXA measurements of body fat is of special interest for the practitioner.
Backward-elimination was applied to select
important variables from the available anthropometrical measurements and
\citet{garcia2005} report a final linear model utilizing
hip circumference, knee breadth and a compound covariate which is defined as
the sum of log chin skinfold, log triceps skinfold and log subscapular skinfold:
<>=
bf_lm <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat)
coef(bf_lm)
@
A simple regression formula which is easy to communicate, such as a
linear combination of only a few covariates, is of special interest in this
application: we employ the \Rcmd{glmboost} function from package
\Rpackage{mboost}
to fit a linear regression model by means of $L_2$Boosting with componentwise
linear least squares. By default, the function \Rcmd{glmboost} fits a linear model (with
initial $m_\text{stop} = 100$ and shrinkage parameter $\nu = 0.1$) by
minimizing squared error (argument \Rcmd{family = Gaussian()} is
the default):
<>=
bf_glm <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)
@
Note that, by default, the mean of the response variable is used as an
offset in the first step of the boosting algorithm. We center the covariates
prior to model fitting in addition.
As mentioned above, the special form of the base learner,
i.e., componentwise linear least squares,
allows for a reformulation of the boosting fit in terms of a linear combination
of the covariates which can be assessed via
<>=
coef(bf_glm)
@
\setkeys{Gin}{height=0.9\textheight, keepaspectratio}
\begin{figure}
\begin{center}
<>=
load(system.file("cache/bodyfat_benchmarks.rda", package = "mboost"))
aic <- AIC(bf_glm)
pdf("figures/bodyfat_glmboost-bodyfat-oob-plot.pdf", version = "1.4", width = 6, height = 10)
par(mai = par("mai") * c(1, 1, 0.5, 1))
mopt <- grid[which.min(colMeans(boob))]
layout(matrix(1:2, nrow = 2))
perfplot(boob, grid, ylab = "Out-of-bootstrap squared error",
xlab = "Number of boosting iterations", alpha = 0.05)
abline(h = mean(boobrest), lty = 2)
lines(c(which.min(colMeans(boob)), which.min(colMeans(boob))),
c(0, min(colMeans(boob))), lty = 2)
points(which.min(colMeans(boob)), min(colMeans(boob)))
plot(aic, ylim = c(3, 5.5))
dev.off()
@
\includegraphics{figures/bodyfat_glmboost-bodyfat-oob-plot.pdf}
\caption{\Robject{bodyfat} data: Out-of-bootstrap squared error for varying
number of
boosting iterations $m_\text{stop}$ (top). The dashed horizontal line
depicts the average out-of-bootstrap error of the linear model
for the pre-selected variables \Robject{hipcirc},
\Robject{kneebreadth}
and \Robject{anthro3a} fitted via ordinary least squares.
The lower part shows the corrected AIC criterion.
\label{bodyfat-oob-plot}}
\end{center}
\end{figure}
We notice that most covariates have been used for fitting
and thus no extensive variable selection was performed in the above model.
Thus, we need to investigate how many boosting iterations are appropriate. Resampling
methods such as cross-validation or the bootstrap can be used to estimate
the out-of-sample error for a varying number of boosting iterations. The
out-of-bootstrap mean
squared error for $100$ bootstrap samples is depicted in the upper part of
Figure~\ref{bodyfat-oob-plot}. The plot
leads to the impression that approximately $m_\text{stop} = \Sexpr{mopt}$ would be a sufficient
number of boosting iterations.
In Section~\ref{subsec.stopping}, a corrected version of the Akaike
information criterion (AIC) is proposed for determining the optimal number
of boosting iterations. This criterion attains its
minimum for
<>=
mstop(aic <- AIC(bf_glm))
@
boosting iterations, see the bottom part of
Figure~\ref{bodyfat-oob-plot} in addition.
The coef\-ficients of the linear model with
$m_\text{stop} = \Sexpr{mstop(aic)}$
boosting iterations are
<>=
coef(bf_glm[mstop(aic)])
@
<>=
cf <- coef(bf_glm[mopt])
nsel <- length(cf)
@
and thus \Sexpr{nsel} covariates have been selected for the final
model (intercept equal to zero occurs here for mean centered response and
predictors and hence,
%TORSTEN: PLEASE CHECK
$n^{-1} \sum_{i=1}^n Y_i = \Sexpr{round(bf_glm$offset, 3)}$
%TORSTEN
is the intercept in the uncentered model). Note that
the variables \Robject{hipcirc}, \Robject{kneebreadth} and
\Robject{anthro3a}, which
we have used for fitting a linear model at the beginning of this
paragraph, have been selected by the boosting algorithm as well.
\SweaveOpts{prefix.string=figures/BH}
\paragraph{Illustration: Prediction of total body fat (cont.)}
<>=
source("setup.R")
@
Being more flexible than the linear model which we fitted to the
\Robject{bodyfat} data in Section~\ref{subsec.compLS}, we estimate an additive
model
%for the mean-centered covariates
using the \Rcmd{gamboost} function from \Rpackage{mboost} (first with
pre-specified $m_\text{stop} = 100$
boosting iterations, $\nu = 0.1$ and squared error loss):
<>=
bf_gam <- gamboost(DEXfat ~ ., data = bodyfat, baselearner = "bss")
@
The degrees of freedom in the componentwise smoothing spline base procedure
can be defined by the \Rcmd{dfbase} argument, defaulting to $4$.
We can estimate the number of boosting iterations $m_\text{stop}$
using the corrected AIC criterion described in
Section~\ref{subsec.stopping}
%(see Figure~\ref{bodyfat-gamboost-AIC})
via
<>=
mstop(aic <- AIC(bf_gam))
@
Similar to the linear regression model, the partial contributions of the covariates
can be extracted from the boosting fit. For the most important variables,
the partial fits are given in Figure~\ref{bodyfat-gamboost-plot} showing some
slight non-linearity, mainly for \Robject{kneebreadth}.
%\begin{figure}
%\begin{center}
%<>=
%plot(aic, ylim = c(3, 5.5))
%@
%\caption{\Robject{bodyfat} data: Corrected AIC as a function of the number of
%boosting iterations $m_\text{stop}$
%for fitting an additive model via \Rmd{gamboost}. \label{bodyfat-gamboost-AIC}}
%\end{center}
%\end{figure}
\setkeys{Gin}{width=0.95\textwidth, keepaspectratio}
\begin{figure}[t]
\begin{center}
<>=
bf_gam <- bf_gam[mstop(aic)]
layout(matrix(1:4, ncol = 2))
plot(bf_gam, which = c("hipcirc", "kneebreadth", "waistcirc", "anthro3b"))
@
\caption{\Robject{bodyfat} data: Partial contributions of four covariates
in an additive model (without centering of estimated functions to mean
zero).
\label{bodyfat-gamboost-plot}}
\end{center}
\end{figure}
\SweaveOpts{prefix.string=figures/BH}
\paragraph{Illustration: Prediction of total body fat (cont.)}
<>=
source("setup.R")
library("splines")
indep <- names(bodyfat)[names(bodyfat) != "DEXfat"]
bsfm <- as.formula(paste("DEXfat ~ ", paste("bs(", indep, ")", collapse = " + ", sep = ""), sep = ""))
@
Such transformations and estimation of a corresponding linear model can be
done with the \Rcmd{glmboost} function,
%%\citep[an implementation of the original fractional polynomials approach is
%%available in package \Rpackage{mfp}, see][]{pkg:mfp,Sauerbreietal2006},
where the model formula performs the computations of all transformations by
means of the \Rcmd{bs} (B-spline basis) function from the package
\Rpackage{splines}.
First, we set up a formula transforming each covariate
<>=
bsfm
@
and then fit the complex linear model by using the \Rcmd{glmboost} function with
initial $m_\text{stop} = 5000$ boosting iterations:
<>=
ctrl <- boost_control(mstop = 5000)
bf_bs <- glmboost(bsfm, data = bodyfat, control = ctrl)
mstop(aic <- AIC(bf_bs))
@
The corrected AIC criterion (see Section~\ref{subsec.stopping}) suggests to
stop after $m_\text{stop} = \Sexpr{mstop(aic)}$
boosting iterations and the final model selects
$\Sexpr{sum(abs(coef(bf_bs[mstop(aic)])) > 0 )}$ (transformed) predictor
variables. Again, the partial
contributions of each of the $\Sexpr{length(indep)}$ original covariates
can be computed
easily and are shown in Figure~\ref{bodyfat-fpboost-plot} (for the same
variables as in Figure~\ref{bodyfat-gamboost-plot}). Note that the depicted
functional relationship derived from the model fitted above
(Figure~\ref{bodyfat-fpboost-plot}) is qualitatively the same as the one
derived from the additive model (Figure~\ref{bodyfat-gamboost-plot}).
%, which is computationally more demanding.
\setkeys{Gin}{width=0.95\textwidth, keepaspectratio}
\begin{figure}[t]
\begin{center}
<>=
layout(matrix(1:4, ncol = 2, byrow = TRUE))
par(mai = par("mai") * c(1, 1, 0.5, 1))
cf <- coef(bf_bs[mstop(aic)])
varorder <- c("hipcirc", "waistcirc", "kneebreadth", "anthro3b")
fpartial <- sapply(varorder, function(v) {
rowSums(predict(bf_bs, which = v))
})
out <- sapply(varorder, function(i) {
x <- bodyfat[,i]
plot(sort(x), fpartial[order(x),i], main = "", type = "b",
xlab = i, ylab = expression(f[partial]), ylim = max(abs(fpartial)) * c(-1, 1))
abline(h = 0, lty = 2, lwd = 0.5)
})
@
\caption{\Robject{bodyfat} data: Partial fits for a linear model fitted to
transformed covariates using B-splines (without centering of estimated
functions to mean zero). \label{bodyfat-fpboost-plot}}
\end{center}
\end{figure}
\SweaveOpts{prefix.string=figures/BH}
\paragraph{Illustration: Breast cancer subtypes}
<>=
source("setup.R")
### OK, this once required Biobase and is a very dirty hack...
data("Westbc", package = "TH.data")
westbc <- Westbc
exprs <- function(x) x$assay
pData <- function(x) x$pheno
p <- nrow(exprs(westbc))
n <- ncol(exprs(westbc))
ytab <- table(pData(westbc)$nodal.y)
@
Variable selection is especially important in high-dimensional situations.
As an example, we study a binary classification problem involving $p = \Sexpr{p}$
gene expression levels in $n = \Sexpr{n}$ breast cancer tumor samples
\citep[data taken from][]{west01}. For each sample, a binary response variable
describes the lymph node status (\Sexpr{ytab[1]} negative and
\Sexpr{ytab[2]} positive).
%of the patient is to be explained by the expression profiles.
The data are stored in form of an \Rclass{exprSet} object \Robject{westbc}
\citep[see][]{BioC2004}
and we first extract the matrix of expression levels and the response variable:
<>=
### extract matrix of expression levels and binary response
x <- t(exprs(westbc))
y <- pData(westbc)$nodal.y
@
We aim at using $L_2$Boosting for classification, see
Section~\ref{subsubsec.binclass},
with classical AIC based on the binomial log-likelihood for stopping the
boosting iterations. Thus, we first
transform the factor \Robject{y} to a numeric variable with $0/1$ coding:
%(different from the $-1/+1$-encoding in Section~\ref{subsec.binclass}):
<>=
### numeric 0/1 response variable
yfit <- as.numeric(y) - 1
@
The general framework implemented in \Rpackage{mboost} allows us to specify
the negative gradient (the
\Rcmd{ngradient} argument) corresponding to the surrogate loss function,
here the squared error loss implemented as a function \Rcmd{rho}, and a different
evaluating loss function (the \Rcmd{loss} argument), here the negative
binomial log-likelihood, with the \Rcmd{Family} function as follows:
<>=
### L2 boosting for classification with response in 0/1
### and binomial log-likelihood as loss function
### ATTENTION: use offset = 1/2 instead of 0!!!
rho <- function(y, f, w = 1) {
p <- pmax(pmin(1 - 1e-5, f), 1e-5)
-y * log(p) - (1 - y) * log(1 - p)
}
ngradient <- function(y, f, w = 1) y - f
offset <- function(y, w) weighted.mean(y, w)
L2fm <- Family(ngradient = ngradient,
loss = rho, offset = offset)
@
The resulting object (called \Rcmd{L2fm}), bundling the negative gradient,
the loss function and a function for computing an offset term (\Rcmd{offset}),
can now be passed to the \Rcmd{glmboost} function for boosting
with componentwise linear least squares (here initial $m_\text{stop} = 200$
iterations are used):
<>=
### fit a linear model with initial mstop = 200 boosting iterations
ctrl <- boost_control(mstop = 200)
west_glm <- glmboost(x, yfit, family = L2fm, center = TRUE,
control = ctrl)
@
<>=
### fit a linear model with initial mstop = 200 boosting iterations
### and record time
time <- system.time(west_glm <- glmboost(x, yfit, family = L2fm, center = TRUE,
control = boost_control(mstop = 200)))[1]
x <- t(exprs(westbc) - rowMeans(exprs(westbc)))
@
Fitting such a linear model to $p = \Sexpr{p}$ covariates for $n = \Sexpr{n}$
observations takes about \Sexpr{formatC(round(time, 1), digits = 2)}
seconds on a medium scale desktop computer
(Intel Pentium 4, 2.8GHz). Thus, this form of estimation and variable
selection is computationally very efficient.
The question how to choose $m_\text{stop}$ can be
addressed by the
classical AIC criterion
%(see Section~\ref{subsec.stopping})
as follows
<>=
### evaluate AIC based on binomial log-likelihood for _all_ boosting
### iterations m = 1, ..., mstop = 200
aic <- AIC(west_glm, method = "classical")
### where should one stop? mstop = 108 or 107
mstop(aic)
@
<>=
cf <- coef(west_glm[mstop(aic)], which = 1:ncol(x))
ps <- length(cf[abs(cf) > 0])
@
where the AIC is computed as -2(log-likelihood) + 2(degrees of freedom) = 2
(evaluating loss) + 2(degrees of freedom), see Formula~(\ref{aicl2}).
The notion of degrees of freedom
is discussed in Section~\ref{subsec.df}.
Figure~\ref{west-glmboost-plot} shows the AIC curve depending on
the number of boosting iterations. When we stop after $m_\text{stop} =
\Sexpr{mstop(aic)}$
boosting iterations, we obtain $\Sexpr{ps}$
genes with non-zero regression coefficients whose standardized values
$\hat{\beta}^{(j)}\sqrt{\widehat{\text{Var}}(X^{(j)})}$
are depicted in the left panel of Figure~\ref{west-glmboost-plot}.
\setkeys{Gin}{width=0.95\textwidth, keepaspectratio}
\begin{figure}
\begin{center}
<>=
### compute standard deviations of expression levels for each gene
sdx <- sqrt(rowSums((t(x) - colMeans(x))^2)/(nrow(x) - 1))
layout(matrix(1:2, ncol = 2))
cf <- cf * sdx
plot(sort(cf[abs(cf) > 0]), ylab = "Standardized coefficients")
abline(h = 0, lty = 2)
plot(aic, ylim = c(25, 40))
@
\caption{\Robject{westbc} data: Standardized regression coefficients
$\hat{\beta}^{(j)}\sqrt{\widehat{\text{Var}}(X^{(j)})}$ (left panel) for
$m_\text{stop} = \Sexpr{mstop(aic)}$ determined from the classical AIC
criterion shown in the right panel. \label{west-glmboost-plot}}
\end{center}
\end{figure}
Of course, we could also use BinomialBoosting for analyzing the data: the
computational CPU time would be of the same order of magnitude, i.e.,
only a few seconds.
\SweaveOpts{prefix.string=figures/BH}
\paragraph{Illustration: Wisconsin prognostic breast cancer}
<>=
source("setup.R")
n <- sum(complete.cases(wpbc))
p <- ncol(wpbc) - 2
@
Prediction models for recurrence events in breast cancer patients
based on covariates which have been computed from a digitized image of a
fine needle aspirate of breast tissue (those measurements describe
characteristics of the cell nuclei present in the image) have been studied
by \citet{street1995} \citep[the data is part of the UCI repository][]{uci1998}.
We first analyze this data as a binary prediction problem
(recurrence vs. non-recurrence) and later in Section~\ref{sec.survival}
by means of survival models. We are faced with many covariates ($p = \Sexpr{p}$)
for a limited number of observations without missing values
($n = \Sexpr{n}$), and variable selection is an important issue. We can
choose a classical logistic regression model
via AIC in a stepwise algorithm as follows
%(after centering the covariates)
<>=
### remove missing values and time variable
cc <- complete.cases(wpbc)
wpbc2 <- wpbc[cc, colnames(wpbc) != "time"]
### fit logistic regression model
wpbc_step <- step(glm(status ~ ., data = wpbc2, family = binomial()), trace = 0)
@
The final model consists of \Sexpr{length(coef(wpbc_step))} parameters with
<>=
logLik(wpbc_step)
AIC(wpbc_step)
@
and we want to compare this model to a logistic regression model fitted via gradient boosting.
We simply select the \Robject{Binomial} family (with default offset of $1/2
\log(\hat{p} / (1 - \hat{p}))$,
where $\hat{p}$ is the empirical proportion of recurrences) and we
initially use
$m_\text{stop} = 500$ boosting iterations
<>=
### fit logistic regression model via gradient boosting
ctrl <- boost_control(mstop = 500)
wpbc_glm <- glmboost(status ~ ., data = wpbc2, family = Binomial(),
center = TRUE, control = ctrl)
@
%The negative binomial log-likelihood is
%<>=
%logLik(wpbc_glm)
%@
The classical AIC criterion (-2 log-likelihood + 2 $\df$) suggests to stop after
<>=
aic <- AIC(wpbc_glm, "classical")
aic
@
boosting iterations.
%%Again, we want to investigate the out-of-bootstrap performance
%%of the boosted logistic regression model. Figure~\ref{wpbc-benchmarks-plot} depicts the
%%out-of-bootstrap (positive) binomial log-likelihood, standardized with
%%the number of out-of-bootstrap observations, for varying number of boosting iterations $m_\text{stop}$.
%%\begin{figure}
%%\begin{center}
%%<>=
%%load(system.file("cache/wpbc_benchmarks.rda", package = "mboost"))
%%grid <- seq(from = 5, to = 500, by = 5)
%%plot(c(0, grid[-1] - 5), -colMeans(boob), type = "b",
%% ylab = "Log-Likelihood / n",
%% xlab = "Number of Boosting Iterations")
%%mopt <- grid[which.max(colMeans(boob))]
%%@
%%\caption{\Robject{wpbc} data: Out-of-bootstrap (positive) binomial
%% log-likelihood. \label{wpbc-benchmarks-plot}}
%%\end{center}
%%\end{figure}
We now restrict the number of boosting iterations to
$m_\text{stop} = \Sexpr{mstop(aic)}$ and then obtain the estimated
coefficients via
<>=
### fit with new mstop
wpbc_glm <- wpbc_glm[mstop(aic)]
coef(wpbc_glm)[abs(coef(wpbc_glm)) > 0]
@
(because of using the offset-value $\hat{f}^{[0]}$, we have to add the value
$\hat{f}^{[0]}$ to the reported intercept estimate above for the logistic
regression model).
%%We then can
%%extract the fitted conditional probabilities
%%<>=
%%f <- fitted(wpbc_glm)
%%### probabilities
%%p <- exp(f) / (exp(f) + exp(-f))
%%@
%%which are depicted by a conditional density plot in
%%Figure~\ref{wpbc-glmboost-prob-plot}.
%\setkeys{Gin}{width=0.7\textwidth, keepaspectratio}
%\begin{figure}
%\begin{center}
%<>=
%cdplot(status ~ p, data = wpbc2)
%@
%\caption{\Robject{wpbc} data: Conditional density plot of the fitted
% probabilities of recurrence / non-recurrence. \label{wpbc-glmboost-prob-plot}}
%\end{center}
%\end{figure}
A generalized additive model adds more flexibility to the regression function but is still
interpretable. We fit a logistic additive model to the \Robject{wpbc} data as follows:
<>=
wpbc_gam <- gamboost(status ~ ., data = wpbc2, family = Binomial(), baselearner = "bss")
mopt <- mstop(aic <- AIC(wpbc_gam, "classical"))
aic
@
This model selected $\Sexpr{length(unique(wpbc_gam$xselect()))}$ out of $\Sexpr{ncol(wpbc2)-1}$
covariates. The partial contributions of the four most important variables
are depicted in
Figure~\ref{wpbc-gamboost-plot} indicating a remarkable degree of non-linearity.
\setkeys{Gin}{width=0.95\textwidth, keepaspectratio}
\begin{figure}[t]
\begin{center}
<>=
wpbc_gam <- wpbc_gam[mopt]
fpartial <- predict(wpbc_gam, which = 1:length(variable.names(wpbc_gam)))
layout(matrix(1:4, nrow = 2, byrow = TRUE))
par(mai = par("mai") * c(1, 1, 0.5, 1))
plot(wpbc_gam, which = rev(order(colMeans(abs(fpartial))))[1:4])
@
\caption{\Robject{wpbc} data: Partial contributions of four selected
covariates in an additive logistic model (without centering of
estimated functions to mean zero).
\label{wpbc-gamboost-plot}}
\end{center}
\end{figure}
\SweaveOpts{prefix.string=figures/BH}
\paragraph{Illustration: Wisconsin prognostic breast cancer (cont.)}
<>=
source("setup.R")
@
Instead of the binary response variable describing the recurrence status, we
make use of the additionally available time information for modeling
the time to recurrence, i.e., all observations with non-recurrence are censored.
First, we calculate IPC weights
<>=
library("survival")
### calculate IPC weights
censored <- wpbc$status == "R"
iw <- IPCweights(Surv(wpbc$time, censored))
wpbc3 <- wpbc[,names(wpbc) != "status"]
@
and fit a weighted linear model by boosting with componentwise linear
weighted least squares as base procedure:
<>=
ctrl <- boost_control(mstop = 500)
wpbc_surv <- glmboost(log(time) ~ ., data = wpbc3,
weights = iw, center = TRUE, control = ctrl)
mstop(aic <- AIC(wpbc_surv))
wpbc_surv <- wpbc_surv[mstop(aic)]
@
The following variables have been selected for fitting
<>=
names(coef(wpbc_surv)[abs(coef(wpbc_surv)) > 0])
@
and the fitted values are
depicted in Figure~\ref{wpbc-glmboost-censored-fit}, showing a reasonable model fit.
\setkeys{Gin}{width=0.7\textwidth, keepaspectratio}
\begin{figure}[t]
\begin{center}
<>=
plot(log(wpbc3$time), predict(wpbc_surv, newdata = wpbc3),
cex = iw, ylim = c(0, 5), xlim = c(0, 5),
xlab = "Time to recurrence (log-scale)",
ylab = "Predicted time to recurrence")
abline(a = 0, b = 1, lty = 2, lwd = 0.5)
@
\caption{\Robject{wpbc} data: Fitted values of an IPC-weighted linear
model, taking both time to recurrence and
censoring information into account. The radius of the circles is proportional to the
IPC weight of the corresponding observation, censored observations with IPC weight zero
are not plotted. \label{wpbc-glmboost-censored-fit}}
\end{center}
\end{figure}
Alternatively, a Cox model with linear predictor can be fitted using
$L_2$Boosting by
implementing the negative gradient
of the partial likelihood (see \cite{ridgew99}) via
%%<>=
%%glmboost(Surv(wpbc$time, wpbc$status == "N") ~ .,
%% data = wpbc, family = CoxPH(), center = TRUE)
%%@
\begin{Schunk}
\begin{Sinput}
R> glmboost(Surv(wpbc$time, wpbc$status == "R") ~ .,
data = wpbc, family = CoxPH(), center = TRUE)
\end{Sinput}
\end{Schunk}
For more examples, such as fitting an additive Cox model using
\Rpackage{mboost}, see \citep{Hothorn:2006:Bioinformatics:16940323}.
\clearpage
\bibliographystyle{plainnat}
\bibliography{boost}
\end{document}