Season tutorial

Adrian Barnett

2018-09-04

season is a package to analyse seasonal data that I developed whilst working on studies in environmental epidemiology. Here I describe some of the key functions.

Seasonal death data

We will use the data on the daily number of deaths from cardiovascular disease (CVD) in people aged 75 and over in Los Angeles for the years 1987 to 2000. Below we load the data and then use ggplot2 to draw a boxplot of the daily death counts by month.

data(CVDdaily)
ggplot(CVDdaily, aes(x=factor(month), y=cvd))+
 geom_boxplot()+
 ylab('Daily number of CVD deaths')+
 xlab('Month')+
 theme_bw()

There is a clear seasonal pattern, with more deaths in the winter months and fewer in the summer. There’s also evidence of a greater variance in the winter months, which we would expect in a count process, as the variance is proportional to the mean.

Plot of deaths over time

It is also useful to plot the data over time as below. To help show the seasonal pattern, we create a vertical reference line for the first day of each year. The plot shows the seasonal peak happened in every winter, although the size of the peak varied between years.

years = 1987:2001
Januarys = as.numeric(as.Date(paste(years, '-01-01', sep=''), origin='1970-01-01'))
ggplot(CVDdaily, aes(x=as.numeric(date), y=cvd))+
 geom_line()+
 scale_x_continuous(breaks=Januarys, labels=years)+
 ylab('Daily number of CVD deaths')+
 xlab('Time')+
 theme_bw()+
 theme(panel.grid.minor = element_blank())

Daily deaths and temperatures plot

Deaths increase in many countries around the world when the temperature is outside an optimal range, with the optimal range varying by climate. The plot below shows daily death counts against daily temperatures. Increases in deaths are apparent at both low and high temperatures, suggesting a non-linear association between temperature and cardiovascular deaths.

ggplot(CVDdaily, aes(x=tmpd, y=cvd))+
 geom_point()+
 ylab('Daily number of CVD deaths')+
 xlab('Temperature (degrees F)')+
 theme_bw()

Regression model

We now examine the association between temperature and death using a case-crossover model. This model compares the number of deaths on case and control days, and only uses controls that are near the case day. By choosing control days near case days, the model controls for long-term trends and seasonal patterns. Below we use the default of cases and controls selected from the same 28 day (4 week) windows. The model is fitted using conditional logistic regression. The technical details are in our book Analysing Seasonal Health Data.

To model a non-linear effect for temperature, we first create a spline for temperture with knots at 60 and 75 degrees Fahrenheit, which essentially means we expect a change in the association around these temperatures.

Deaths due to temperature can occur days after exposure. For example, when a person has a heart attack on a hot day, is admitted to hospital alive, but dies in hospital some days later. To account for this we include a lag of 14 days. By using a lagged temperature we lose a few observations at the start of the data, because we do not have temperature data from the year 1986. We use the [dlnm](https://cran.r-project.org/web/packages/dlnm/index.html) library to create the non-linear and lagged spline basis.

We include a categorical variable of day of the week, because there is evidence that deaths vary by day of the week.

The model takes a short while to run.

# make a spline basis that has a lag and is non-linear
tmpd.basis = crossbasis(CVDdaily$tmpd, lag=14, # 14 day lag
                arglag=list(fun='ns', df=3), # 3 degrees of freedom for lag; ns = natural spline
                argvar=list(fun='ns', knots=c(60, 75))) # knots at 65 and 75 degrees
# add the spline basis variables to the data
CVDdaily = cbind(CVDdaily, tmpd.basis[1:nrow(CVDdaily), ])
# create the regression formula
spline.names = colnames(tmpd.basis)
formula = paste('cvd ~', paste(spline.names, collapse = ' + '), '+ Mon + Tue + Wed + Thu + Fri + Sat')
model = casecross(as.formula(formula), data=CVDdaily)
summary(model)
#> Time-stratified case-crossover with a stratum length of 28 days
#> Total number of cases 229759 
#> Number of case days with available control days 5098 
#> Average number of control days per case day 23.2 
#> 
#> Parameter Estimates:
#>               coef exp(coef)    se(coef)          z     Pr(>|z|)
#> v1.l1 -0.020874306 0.9793421 0.011889922 -1.7556303 7.915156e-02
#> v1.l2 -0.066362396 0.9357917 0.008656561 -7.6661383 1.776357e-14
#> v1.l3  0.002178049 1.0021804 0.008944090  0.2435182 8.076040e-01
#> v2.l1 -0.191012092 0.8261226 0.038352025 -4.9804957 6.342163e-07
#> v2.l2  0.014749261 1.0148586 0.029740115  0.4959383 6.199380e-01
#> v2.l3 -0.014302130 0.9857997 0.029749686 -0.4807490 6.306949e-01
#> v3.l1 -0.150126292 0.8605993 0.023333994 -6.4338019 1.244510e-10
#> v3.l2  0.117420420 1.1245921 0.017741751  6.6183107 3.633271e-11
#> v3.l3 -0.058042887 0.9436095 0.018876807 -3.0748254 2.106258e-03
#> Mon    0.036431253 1.0371030 0.007820245  4.6585818 3.183953e-06
#> Tue    0.018269159 1.0184371 0.007841133  2.3299131 1.981074e-02
#> Wed   -0.011365776 0.9886986 0.007868064 -1.4445454 1.485856e-01
#> Thu   -0.008998068 0.9910423 0.007856868 -1.1452488 2.521061e-01
#> Fri    0.009599729 1.0096460 0.007855806  1.2219916 2.217108e-01
#> Sat    0.014931771 1.0150438 0.007861585  1.8993336 5.752063e-02

This is a large study with just under 230,000 deaths and over 5,000 case days. The coefficients are the log odds ratios. Here and elsewhere in this vignette, the estimates are quoted to many decimal places, but when presented in a paper we recommend using these guidelines.

We can see there are more deaths on Monday compared with the reference day of Sunday. The temperature estimates are hard to interpret and are best shown by reconstructing the nine spline estimates in a plot.

Confidence intervals

The display above does not give confidence intervals for the log odds ratios, but these can easily be created as follows (which gives 95% confidence intervals).

confint(model$c.model)
#>               2.5 %       97.5 %
#> v1.l1 -0.0441781242  0.002429512
#> v1.l2 -0.0833289446 -0.049395848
#> v1.l3 -0.0153520450  0.019708143
#> v2.l1 -0.2661806785 -0.115843505
#> v2.l2 -0.0435402932  0.073038815
#> v2.l3 -0.0726104439  0.044006183
#> v3.l1 -0.1958600798 -0.104392505
#> v3.l2  0.0826472273  0.152193613
#> v3.l3 -0.0950407498 -0.021045024
#> Mon    0.0211038533  0.051758652
#> Tue    0.0029008204  0.033637498
#> Wed   -0.0267868986  0.004055347
#> Thu   -0.0243972457  0.006401110
#> Fri   -0.0057973676  0.024996826
#> Sat   -0.0004766513  0.030340194

Plot of the non-linear association between temperature and death

We use the coefficients and their variance–covariance matrix to reconstruct a three-dimensional plot by lag and temperature. We examine temperatures over the range 45 to 85 degrees Fahrenheit. The estimates are centred using a reference temperature of 70 degrees.

# extract the coefficients and variance--covariance matrix for the spline terms
coef = coefficients(model$c.model)
index = names(coef) %in% spline.names
coef = coef[index]
vcov = vcov(model$c.model)[index, index]
for.plot = crosspred(basis=tmpd.basis, coef=coef, vcov=vcov, at=seq(45, 85, 1), model.link = 'log', cen=70)
par(mai=c(0.2, 0, 0, 0)) # reduce plot margins
plot(for.plot, xlab='Temperature (degrees F)', zlab='Odds ratio', ylab='Lag (days)')

The dominant feature is a large spike in deaths at high temperatures on the same day of exposure (lag day zero).

Plot of the temperature and death association averaging over all lags

Another useful plot is the overall change in risk which summarises the results across all lags. The plot shows the mean odds (solid line) and 95% confidence interval (shaded area). We first put the estimates into a data.frame so they can be used in ggplot2.

to.plot = data.frame(temperature = for.plot$predvar, 
                     mean = for.plot$allRRfit,
                     lower = for.plot$allRRlow,
                     upper = for.plot$allRRhigh)
ggplot(data=to.plot, aes(x=temperature, y=mean, ymin=lower, ymax=upper))+
  geom_hline(lty=2, yintercept = 1)+ # horizontal reference line at no change in odds
  geom_ribbon(alpha=0.3)+
  geom_line()+
  xlab('Temperature (degrees F)')+
  ylab('Odds ratio')+
  theme_bw()

The plot shows a U-shaped risk, with increased odds at low and high temperatures relative to the reference temperature of 70 degrees.

Non-stationary seasonal patterns

As shown in the second plot in this vignette, the seasonal pattern appeared to vary from year-to-year, with larger peaks in some years. This is a non-stationary seasonal pattern. We can model this pattern by splitting the time series into a trend, seasonal pattern(s) and residuals. Details on the method are available in this paper: Estimating trends and seasonality in coronary heart disease.

We use the monthly data rather than the daily data because we are primarily interested in seasonal patterns, and the daily data will take much longer to run. We use the response variable of adj as this is the adjusted monthly counts of CVD deaths which accounts for the differences in month lengths (e.g., 28 or 29 days in February and 31 in January).

This model takes a few minutes to run because it uses Markov chain Monte Carlo samples to estimate the model parameters.

set.seed(1234) # set the random seed to give repeatable results
data(CVD)
f = c(12) # a single twelve month cycle
tau = c(10, 50) # achieved via trial-and-error; small tau -> less variability
ns.season = nscosinor(data=CVD, response='adj', cycles=f, niters=2000,
         burnin=500, tau=tau, div=1000)
#> Iteration number 1000 of 2000 
#> Iteration number 2000 of 2000
summary(ns.season)
#> Statistics for non-stationary cosinor based on MCMC chains
#> Number of MCMC samples = 1501
#> 
#> Standard deviations
#> Residual, mean=112.7483, 95% CI [99.98868, 126.5613]
#> Cycle=12
#> Season, mean=0.1724519, 95% CI [0.04542425, 0.3440523]
#> 
#> Phase and amplitude
#> Cycle=12
#> Amplitude, mean=207.1926, 95% CI [181.6384, 231.9424]
#> Phase (radians), mean=0.6952055, 95% CI [0.5732958, 0.8216251]
plot(ns.season)

The estimated mean amplitude is 207, so on average there 207 extra deaths per month in the seasonal winter peak compared with the year-round average. The 95% confidence interval for the peak is from 182 to 232 monthly deaths.

The plot shows the long-term non-linear trend and non-stationary seasonal pattern. The seasonal peaks are higher in some years, including 1989 and 2000, which matches the above plot of the data.

Back transforming the phase

The phase, or timing of the seasonal peak, in the results above is given in radians (on a scale of 0 to 2\(\pi\)). We can transform this to a more useful time scale by transforming the summary statistics.

cat('Mean phase = ', round(invyrfraction(0.6952055/(2*pi), type='monthly', text=F), 2), ' months.\n', sep='')
#> Mean phase = 2.33 months.
cat('Lower 95% interval = ', round(invyrfraction(0.5732958/(2*pi), type='monthly', text=F), 2), ' months.\n', sep='')
#> Lower 95% interval = 2.09 months.
cat('Upper 95% interval = ', round(invyrfraction(0.8216251/(2*pi), type='monthly', text=F), 2), ' months.\n', sep='')
#> Upper 95% interval = 2.57 months.

The estimated peak in deaths is in early February.

Testing for non-linearity in time series

Another useful function is a test of non-linearity in time series using the third-order moment. This is the non-linear extension of the more familiar second-order tests of linearity, such as the autocorrelation function. Here we check for any remaining non-linearity in the time series of residuals from the non-stationary model of the seasonal pattern in monthly CVD deaths. We check up to a lag of 12 months.

set.seed(1234) # set the random seed to give repeatable results
ntest.residuals = nonlintest(ns.season$residuals, n.lag=12, n.boot=500)
ntest.residuals
#> Largest and smallest co-ordinates of the third-order moment outside the test limits
#> Largest positive difference at lags:
#> 4 4 
#> Size of largest largest difference:
#> 376966.7 
#> Largest negative difference at lags:
#> 0 2 
#> Size of largest negative difference:
#> -676267.8 
#> 
#> Bootstrap test of non-linearity using the third-order moment
#> Statistics for areas outside test limits:
#> observed     obs/SD     median-null    95%-null    p-value
#> 2806273 7.034735 291795.1 1231552 0

There is evidence of remaining non-linearity in the residuals at lags of 2 to 4 months.

plot = plot(ntest.residuals, plot=FALSE)
plot + scale_x_continuous(breaks = 0:12) +
  scale_y_continuous(breaks = 0:12) +
  theme_bw()+
  xlab('Lag 1 (months)')+
  ylab('Lag 2 (months)')+
  theme(panel.grid.minor = element_blank())

The plot of the third order moment shows the lags with the strongest non-linear interactions at (0,2), (4,4) and (10,10).