Introduction to the msSPChelpR package - from long dataset to SIR analyses

Marian Eberl

26 October 2020

Introduction

This vignette explains how to use the functions:

For some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use:

Theory behind SIRs

In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.

Examples

SEER lung cancer

Step 1 - Long dataset

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
library(msSPChelpR)
#Load synthetic dataset of patients with cancer to demonstrate package functions
data("us_second_cancer")

#This dataset is in long format, so each tumor is a separate row in the data
us_second_cancer
#> # A tibble: 113,999 x 15
#>    fake_id SEQ_NUM registry   sex   race  datebirth  t_datediag t_site_icd t_dco
#>    <chr>     <int> <chr>      <chr> <chr> <date>     <date>     <chr>      <chr>
#>  1 100004        1 SEER Reg ~ Male  White 1926-01-01 1992-07-15 C50        hist~
#>  2 100004        2 SEER Reg ~ Male  White 1926-01-01 2004-01-15 C54        hist~
#>  3 100004        3 SEER Reg ~ Male  White 1926-01-01 2006-06-15 C34        hist~
#>  4 100004        4 SEER Reg ~ Male  White 1926-01-01 2018-06-15 C14        DCO ~
#>  5 100034        1 SEER Reg ~ Male  White 1979-01-01 2000-06-15 C50        hist~
#>  6 100037        1 SEER Reg ~ Fema~ White 1938-01-01 1996-01-15 C54        hist~
#>  7 100038        1 SEER Reg ~ Male  White 1989-01-01 1991-04-15 C50        hist~
#>  8 100038        2 SEER Reg ~ Male  White 1989-01-01 2000-03-15 C80        hist~
#>  9 100039        1 SEER Reg ~ Fema~ White 1946-01-01 2003-08-15 C50        hist~
#> 10 100039        2 SEER Reg ~ Fema~ White 1946-01-01 2011-04-15 C34        hist~
#> # ... with 113,989 more rows, and 6 more variables: fc_age <int>,
#> #   datedeath <date>, p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>,
#> #   t_yeardiag <chr>

Step 2 - Filter long dataset

#filter for lung cancer
ids <- us_second_cancer %>%
  #detect ids with any lung cancer
  filter(t_site_icd == "C34") %>%
  select(fake_id) %>%
  as.vector() %>%
  unname() %>%
  unlist()

filtered_usdata <- us_second_cancer %>%
  #filter according to above detected ids with any lung cancer diagnosis
  filter(fake_id %in% ids) %>%
   arrange(fake_id)

filtered_usdata
#> # A tibble: 62,661 x 15
#>    fake_id SEQ_NUM registry   sex   race  datebirth  t_datediag t_site_icd t_dco
#>    <chr>     <int> <chr>      <chr> <chr> <date>     <date>     <chr>      <chr>
#>  1 100004        1 SEER Reg ~ Male  White 1926-01-01 1992-07-15 C50        hist~
#>  2 100004        2 SEER Reg ~ Male  White 1926-01-01 2004-01-15 C54        hist~
#>  3 100004        3 SEER Reg ~ Male  White 1926-01-01 2006-06-15 C34        hist~
#>  4 100004        4 SEER Reg ~ Male  White 1926-01-01 2018-06-15 C14        DCO ~
#>  5 100039        1 SEER Reg ~ Fema~ White 1946-01-01 2003-08-15 C50        hist~
#>  6 100039        2 SEER Reg ~ Fema~ White 1946-01-01 2011-04-15 C34        hist~
#>  7 100039        3 SEER Reg ~ Fema~ White 1946-01-01 2018-01-15 C80        hist~
#>  8 100073        1 SEER Reg ~ Male  White 1960-01-01 1993-11-15 C44        hist~
#>  9 100073        2 SEER Reg ~ Male  White 1960-01-01 2003-12-15 C34        hist~
#> 10 100143        1 SEER Reg ~ Male  White 1944-01-01 1992-03-15 C50        hist~
#> # ... with 62,651 more rows, and 6 more variables: fc_age <int>,
#> #   datedeath <date>, p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>,
#> #   t_yeardiag <chr>

Step 3 - Renumber time_id

renumbered_usdata <- filtered_usdata %>%
  renumber_time_id(new_time_id_var = "t_tumid", 
                   dattype = "seer",
                   case_id_var = "fake_id")

renumbered_usdata %>%
   select(fake_id, sex, t_site_icd, t_datediag, t_tumid)
#> # A tibble: 62,661 x 5
#>    fake_id sex    t_site_icd t_datediag t_tumid
#>    <chr>   <chr>  <chr>      <date>       <int>
#>  1 100004  Male   C50        1992-07-15       1
#>  2 100004  Male   C54        2004-01-15       2
#>  3 100004  Male   C34        2006-06-15       3
#>  4 100004  Male   C14        2018-06-15       4
#>  5 100039  Female C50        2003-08-15       1
#>  6 100039  Female C34        2011-04-15       2
#>  7 100039  Female C80        2018-01-15       3
#>  8 100073  Male   C44        1993-11-15       1
#>  9 100073  Male   C34        2003-12-15       2
#> 10 100143  Male   C50        1992-03-15       1
#> # ... with 62,651 more rows

Step 4 - Reshape to wide dataset

usdata_wide <- renumbered_usdata %>%
  reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10)

#now the data is in the wide format as required by many package functions. 
#This means, each case is a row and several tumors per case ID are 
#add new columns to the data using the time_id as column name suffix.
usdata_wide
#> # A tibble: 31,997 x 127
#>    fake_id SEQ_NUM.1 registry.1            sex.1 race.1 datebirth.1 t_datediag.1
#>    <chr>       <int> <chr>                 <chr> <chr>  <date>      <date>      
#>  1 100004          1 SEER Reg 20 - Detroi~ Male  White  1926-01-01  1992-07-15  
#>  2 100039          1 SEER Reg 02 - Connec~ Fema~ White  1946-01-01  2003-08-15  
#>  3 100073          1 SEER Reg 01 - San Fr~ Male  White  1960-01-01  1993-11-15  
#>  4 100143          1 SEER Reg 02 - Connec~ Male  White  1944-01-01  1992-03-15  
#>  5 100182          1 SEER Reg 02 - Connec~ Male  Other  1927-01-01  1991-09-15  
#>  6 100197          1 SEER Reg 02 - Connec~ Fema~ White  1945-01-01  2012-06-15  
#>  7 100208          1 SEER Reg 02 - Connec~ Male  White  1970-01-01  2019-11-15  
#>  8 100230          1 SEER Reg 01 - San Fr~ Male  White  1947-01-01  1992-11-15  
#>  9 100234          1 SEER Reg 01 - San Fr~ Male  White  1988-01-01  2010-02-15  
#> 10 100266          1 SEER Reg 01 - San Fr~ Fema~ White  1956-01-01  2010-07-15  
#> # ... with 31,987 more rows, and 120 more variables: t_site_icd.1 <chr>,
#> #   t_dco.1 <chr>, fc_age.1 <int>, datedeath.1 <date>, p_alive.1 <chr>,
#> #   p_dodmin.1 <date>, fc_agegroup.1 <chr>, t_yeardiag.1 <chr>,
#> #   SEQ_NUM.2 <int>, registry.2 <chr>, sex.2 <chr>, race.2 <chr>,
#> #   datebirth.2 <date>, t_datediag.2 <date>, t_site_icd.2 <chr>, t_dco.2 <chr>,
#> #   fc_age.2 <int>, datedeath.2 <date>, p_alive.2 <chr>, p_dodmin.2 <date>,
#> #   fc_agegroup.2 <chr>, t_yeardiag.2 <chr>, SEQ_NUM.3 <int>, registry.3 <chr>,
#> #   sex.3 <chr>, race.3 <chr>, datebirth.3 <date>, t_datediag.3 <date>,
#> #   t_site_icd.3 <chr>, t_dco.3 <chr>, fc_age.3 <int>, datedeath.3 <date>,
#> #   p_alive.3 <chr>, p_dodmin.3 <date>, fc_agegroup.3 <chr>,
#> #   t_yeardiag.3 <chr>, SEQ_NUM.4 <int>, registry.4 <chr>, sex.4 <chr>,
#> #   race.4 <chr>, datebirth.4 <date>, t_datediag.4 <date>, t_site_icd.4 <chr>,
#> #   t_dco.4 <chr>, fc_age.4 <int>, datedeath.4 <date>, p_alive.4 <chr>,
#> #   p_dodmin.4 <date>, fc_agegroup.4 <chr>, t_yeardiag.4 <chr>,
#> #   SEQ_NUM.5 <int>, registry.5 <chr>, sex.5 <chr>, race.5 <chr>,
#> #   datebirth.5 <date>, t_datediag.5 <date>, t_site_icd.5 <chr>, t_dco.5 <chr>,
#> #   fc_age.5 <int>, datedeath.5 <date>, p_alive.5 <chr>, p_dodmin.5 <date>,
#> #   fc_agegroup.5 <chr>, t_yeardiag.5 <chr>, SEQ_NUM.6 <int>, registry.6 <chr>,
#> #   sex.6 <chr>, race.6 <chr>, datebirth.6 <date>, t_datediag.6 <date>,
#> #   t_site_icd.6 <chr>, t_dco.6 <chr>, fc_age.6 <int>, datedeath.6 <date>,
#> #   p_alive.6 <chr>, p_dodmin.6 <date>, fc_agegroup.6 <chr>,
#> #   t_yeardiag.6 <chr>, SEQ_NUM.7 <int>, registry.7 <chr>, sex.7 <chr>,
#> #   race.7 <chr>, datebirth.7 <date>, t_datediag.7 <date>, t_site_icd.7 <chr>,
#> #   t_dco.7 <chr>, fc_age.7 <int>, datedeath.7 <date>, p_alive.7 <chr>,
#> #   p_dodmin.7 <date>, fc_agegroup.7 <chr>, t_yeardiag.7 <chr>,
#> #   SEQ_NUM.8 <int>, registry.8 <chr>, sex.8 <chr>, race.8 <chr>,
#> #   datebirth.8 <date>, t_datediag.8 <date>, t_site_icd.8 <chr>, t_dco.8 <chr>,
#> #   ...

Step 5 - Recalculate p_spc


usdata_wide <- usdata_wide %>%
  dplyr::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ "No SPC",
                         !is.na(t_site_icd.2)           ~ "SPC developed",
                         TRUE ~ NA_character_)) %>%
  #create the same information as numeric variable count_spc
  dplyr::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2)   ~ 1,
                            TRUE ~ 0))
usdata_wide %>%
   dplyr::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1, 
                 t_datediag.1, t_site_icd.2, t_datediag.2)
#> # A tibble: 31,997 x 8
#>    fake_id sex.1  p_spc         count_spc t_site_icd.1 t_datediag.1 t_site_icd.2
#>    <chr>   <chr>  <chr>             <dbl> <chr>        <date>       <chr>       
#>  1 100004  Male   SPC developed         0 C50          1992-07-15   C54         
#>  2 100039  Female SPC developed         0 C50          2003-08-15   C34         
#>  3 100073  Male   SPC developed         0 C44          1993-11-15   C34         
#>  4 100143  Male   SPC developed         0 C50          1992-03-15   C34         
#>  5 100182  Male   SPC developed         0 C18          1991-09-15   C34         
#>  6 100197  Female SPC developed         0 C34          2012-06-15   C50         
#>  7 100208  Male   No SPC                1 C34          2019-11-15   <NA>        
#>  8 100230  Male   SPC developed         0 C44          1992-11-15   C34         
#>  9 100234  Male   No SPC                1 C34          2010-02-15   <NA>        
#> 10 100266  Female No SPC                1 C34          2010-07-15   <NA>        
#> # ... with 31,987 more rows, and 1 more variable: t_datediag.2 <date>

Step 6 - Determine patient status at end of FU

usdata_wide <- usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = FALSE, check = TRUE, 
             as_labelled_factor = TRUE)
#> # A tibble: 10 x 3
#>    p_alive.1 p_status                                                          n
#>    <chr>     <fct>                                                         <int>
#>  1 Alive     Patient alive after FC (with or without following SPC after ~  5940
#>  2 Alive     Patient alive after SPC                                       11316
#>  3 Alive     NA - Patient not born before end of FU                            4
#>  4 Alive     NA - Patient did not develop cancer before end of FU            849
#>  5 Dead      Patient alive after FC (with or without following SPC after ~   863
#>  6 Dead      Patient alive after SPC                                        1360
#>  7 Dead      Patient dead after FC                                          6208
#>  8 Dead      Patient dead after SPC                                         5325
#>  9 Dead      NA - Patient did not develop cancer before end of FU             68
#> 10 Dead      NA - Patient date of death is missing                            64
#> # A tibble: 7 x 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6803
#> 2 Patient alive after SPC                                                12676
#> 3 Patient dead after FC                                                   6208
#> 4 Patient dead after SPC                                                  5325
#> 5 NA - Patient not born before end of FU                                     4
#> 6 NA - Patient did not develop cancer before end of FU                     917
#> 7 NA - Patient date of death is missing                                     64

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1, 
                 t_site_icd.2, t_datediag.2)
#> # A tibble: 31,997 x 8
#>    fake_id p_status p_alive.1 datedeath.1 t_site_icd.1 t_datediag.1 t_site_icd.2
#>    <chr>   <fct>    <chr>     <date>      <chr>        <date>       <chr>       
#>  1 100004  Patient~ Alive     NA          C50          1992-07-15   C54         
#>  2 100039  Patient~ Alive     NA          C50          2003-08-15   C34         
#>  3 100073  Patient~ Dead      2005-06-01  C44          1993-11-15   C34         
#>  4 100143  Patient~ Alive     NA          C50          1992-03-15   C34         
#>  5 100182  Patient~ Dead      2007-05-01  C18          1991-09-15   C34         
#>  6 100197  Patient~ Alive     NA          C34          2012-06-15   C50         
#>  7 100208  NA - Pa~ Alive     NA          C34          2019-11-15   <NA>        
#>  8 100230  Patient~ Dead      2008-05-01  C44          1992-11-15   C34         
#>  9 100234  Patient~ Dead      2015-07-01  C34          2010-02-15   <NA>        
#> 10 100266  Patient~ Alive     NA          C34          2010-07-15   <NA>        
#> # ... with 31,987 more rows, and 1 more variable: t_datediag.2 <date>

#alternatively, you can impute the date of death using lifedatmin_var
usdata_wide %>%
  pat_status(., fu_end = "2017-12-31", dattype = "seer",
             status_var = "p_status", life_var = "p_alive.1",
             spc_var = "p_spc", birthdat_var = "datebirth.1",
             lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
             spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
             life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
             spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
             use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1", 
             check = TRUE, as_labelled_factor = TRUE)
#> # A tibble: 9 x 3
#>   p_alive.1 p_status                                                           n
#>   <chr>     <fct>                                                          <int>
#> 1 Alive     Patient alive after FC (with or without following SPC after e~  5940
#> 2 Alive     Patient alive after SPC                                        11316
#> 3 Alive     NA - Patient not born before end of FU                             4
#> 4 Alive     NA - Patient did not develop cancer before end of FU             849
#> 5 Dead      Patient alive after FC (with or without following SPC after e~   867
#> 6 Dead      Patient alive after SPC                                         1361
#> 7 Dead      Patient dead after FC                                           6230
#> 8 Dead      Patient dead after SPC                                          5362
#> 9 Dead      NA - Patient did not develop cancer before end of FU              68
#> # A tibble: 6 x 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6807
#> 2 Patient alive after SPC                                                12677
#> 3 Patient dead after FC                                                   6230
#> 4 Patient dead after SPC                                                  5362
#> 5 NA - Patient not born before end of FU                                     4
#> 6 NA - Patient did not develop cancer before end of FU                     917
#> # A tibble: 31,997 x 130
#>    fake_id SEQ_NUM.1 registry.1            sex.1 race.1 datebirth.1 t_datediag.1
#>    <chr>       <int> <chr>                 <chr> <chr>  <date>      <date>      
#>  1 100004          1 SEER Reg 20 - Detroi~ Male  White  1926-01-01  1992-07-15  
#>  2 100039          1 SEER Reg 02 - Connec~ Fema~ White  1946-01-01  2003-08-15  
#>  3 100073          1 SEER Reg 01 - San Fr~ Male  White  1960-01-01  1993-11-15  
#>  4 100143          1 SEER Reg 02 - Connec~ Male  White  1944-01-01  1992-03-15  
#>  5 100182          1 SEER Reg 02 - Connec~ Male  Other  1927-01-01  1991-09-15  
#>  6 100197          1 SEER Reg 02 - Connec~ Fema~ White  1945-01-01  2012-06-15  
#>  7 100208          1 SEER Reg 02 - Connec~ Male  White  1970-01-01  2019-11-15  
#>  8 100230          1 SEER Reg 01 - San Fr~ Male  White  1947-01-01  1992-11-15  
#>  9 100234          1 SEER Reg 01 - San Fr~ Male  White  1988-01-01  2010-02-15  
#> 10 100266          1 SEER Reg 01 - San Fr~ Fema~ White  1956-01-01  2010-07-15  
#> # ... with 31,987 more rows, and 123 more variables: t_site_icd.1 <chr>,
#> #   t_dco.1 <chr>, fc_age.1 <int>, datedeath.1 <date>, p_alive.1 <chr>,
#> #   p_dodmin.1 <date>, fc_agegroup.1 <chr>, t_yeardiag.1 <chr>,
#> #   SEQ_NUM.2 <int>, registry.2 <chr>, sex.2 <chr>, race.2 <chr>,
#> #   datebirth.2 <date>, t_datediag.2 <date>, t_site_icd.2 <chr>, t_dco.2 <chr>,
#> #   fc_age.2 <int>, datedeath.2 <date>, p_alive.2 <chr>, p_dodmin.2 <date>,
#> #   fc_agegroup.2 <chr>, t_yeardiag.2 <chr>, SEQ_NUM.3 <int>, registry.3 <chr>,
#> #   sex.3 <chr>, race.3 <chr>, datebirth.3 <date>, t_datediag.3 <date>,
#> #   t_site_icd.3 <chr>, t_dco.3 <chr>, fc_age.3 <int>, datedeath.3 <date>,
#> #   p_alive.3 <chr>, p_dodmin.3 <date>, fc_agegroup.3 <chr>,
#> #   t_yeardiag.3 <chr>, SEQ_NUM.4 <int>, registry.4 <chr>, sex.4 <chr>,
#> #   race.4 <chr>, datebirth.4 <date>, t_datediag.4 <date>, t_site_icd.4 <chr>,
#> #   t_dco.4 <chr>, fc_age.4 <int>, datedeath.4 <date>, p_alive.4 <chr>,
#> #   p_dodmin.4 <date>, fc_agegroup.4 <chr>, t_yeardiag.4 <chr>,
#> #   SEQ_NUM.5 <int>, registry.5 <chr>, sex.5 <chr>, race.5 <chr>,
#> #   datebirth.5 <date>, t_datediag.5 <date>, t_site_icd.5 <chr>, t_dco.5 <chr>,
#> #   fc_age.5 <int>, datedeath.5 <date>, p_alive.5 <chr>, p_dodmin.5 <date>,
#> #   fc_agegroup.5 <chr>, t_yeardiag.5 <chr>, SEQ_NUM.6 <int>, registry.6 <chr>,
#> #   sex.6 <chr>, race.6 <chr>, datebirth.6 <date>, t_datediag.6 <date>,
#> #   t_site_icd.6 <chr>, t_dco.6 <chr>, fc_age.6 <int>, datedeath.6 <date>,
#> #   p_alive.6 <chr>, p_dodmin.6 <date>, fc_agegroup.6 <chr>,
#> #   t_yeardiag.6 <chr>, SEQ_NUM.7 <int>, registry.7 <chr>, sex.7 <chr>,
#> #   race.7 <chr>, datebirth.7 <date>, t_datediag.7 <date>, t_site_icd.7 <chr>,
#> #   t_dco.7 <chr>, fc_age.7 <int>, datedeath.7 <date>, p_alive.7 <chr>,
#> #   p_dodmin.7 <date>, fc_agegroup.7 <chr>, t_yeardiag.7 <chr>,
#> #   SEQ_NUM.8 <int>, registry.8 <chr>, sex.8 <chr>, race.8 <chr>,
#> #   datebirth.8 <date>, t_datediag.8 <date>, t_site_icd.8 <chr>, t_dco.8 <chr>,
#> #   ...

Step 6b - Remove patients irrelevant to analysis depending on status

usdata_wide <- usdata_wide %>%
  dplyr::filter(!p_status %in% c("NA - Patient not born before end of FU",
                                 "NA - Patient did not develop cancer before end of FU",
                                 "NA - Patient date of death is missing"))

usdata_wide %>%
  dplyr::count(p_status)
#> # A tibble: 4 x 2
#>   p_status                                                                   n
#>   <fct>                                                                  <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU)  6803
#> 2 Patient alive after SPC                                                12676
#> 3 Patient dead after FC                                                   6208
#> 4 Patient dead after SPC                                                  5325

Step 7 - Calculate FU time

usdata_wide <- usdata_wide %>%
   calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31",
               dattype = "seer", time_unit = "years", 
               lifedat_var = "datedeath.1", 
               fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2")
#> # A tibble: 4 x 5
#>   p_status                       mean_futime min_futime max_futime median_futime
#>   <fct>                                <dbl>      <dbl>      <dbl>         <dbl>
#> 1 Patient alive after FC (with ~        9.58     0.0438       27.0          8.29
#> 2 Patient alive after SPC               8.69     0            26.9          7.50
#> 3 Patient dead after FC                 8.54     0            25.8          7.47
#> 4 Patient dead after SPC                6.33     0            26.5          5.08

usdata_wide %>%
   dplyr::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)
#> # A tibble: 31,012 x 7
#>    fake_id p_status  p_futimeyrs p_alive.1 datedeath.1 t_datediag.1 t_datediag.2
#>    <chr>   <fct>           <dbl> <chr>     <date>      <date>       <date>      
#>  1 100004  Patient ~       11.5  Alive     NA          1992-07-15   2004-01-15  
#>  2 100039  Patient ~        7.67 Alive     NA          2003-08-15   2011-04-15  
#>  3 100073  Patient ~       10.1  Dead      2005-06-01  1993-11-15   2003-12-15  
#>  4 100143  Patient ~        3.33 Alive     NA          1992-03-15   1995-07-15  
#>  5 100182  Patient ~        7.08 Dead      2007-05-01  1991-09-15   1998-10-15  
#>  6 100197  Patient ~        4.83 Alive     NA          2012-06-15   2017-04-15  
#>  7 100230  Patient ~       11.0  Dead      2008-05-01  1992-11-15   2003-11-15  
#>  8 100234  Patient ~        5.37 Dead      2015-07-01  2010-02-15   NA          
#>  9 100266  Patient ~        7.46 Alive     NA          2010-07-15   NA          
#> 10 100274  Patient ~        2.38 Dead      2006-06-01  2004-01-15   NA          
#> # ... with 31,002 more rows

Step 8 - Calculate SIR

sircalc_results <- usdata_wide %>%
  sir_byfutime(
    dattype = "seer",
    ybreak_vars = c("race.1", "t_dco.1"),
    xbreak_var = "none",
    futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf),
    count_var = "count_spc",
    refrates_df = us_refrates_icd2,
    calc_total_row = TRUE,
    calc_total_fu = TRUE,
    region_var = "registry.1",
    age_var = "fc_agegroup.1",
    sex_var = "sex.1",
    year_var = "t_yeardiag.1",
    site_var = "t_site_icd.1", #using grouping by second cancer incidence
    futime_var = "p_futimeyrs",
    alpha = 0.05)
#> 
#>  [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed. 
#> # A tidytable: 4 x 10
#>   age    sex    region     year  race.1 t_site to1month i_observed i_pyar n_base
#>   <chr>  <chr>  <chr>      <chr> <chr>  <chr>     <dbl>      <dbl>  <dbl>  <dbl>
#> 1 05 - ~ Male   SEER Reg ~ 2015~ Other  C34           1          1      0      1
#> 2 15 - ~ Male   SEER Reg ~ 2005~ Black  C34           1          1      0      1
#> 3 85 - ~ Female SEER Reg ~ 1995~ Black  C34           1          2      0      2
#> 4 85 - ~ Male   SEER Reg ~ 2010~ Other  C34           1          1      0      1 
#> Check attribute `problems_not_empty` of results to see what strata are affected. 
#> This might be caused by cases where SPC occured at the same day as first cancer. 
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#> 
#>  [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed. 
#> # A tidytable: 4 x 10
#>   age    sex    region    year  race.1 t_site Total0toInfyears i_observed i_pyar
#>   <chr>  <chr>  <chr>     <chr> <chr>  <chr>             <dbl>      <dbl>  <dbl>
#> 1 05 - ~ Male   SEER Reg~ 2015~ Other  C34                   1          1      0
#> 2 15 - ~ Male   SEER Reg~ 2005~ Black  C34                   1          1      0
#> 3 85 - ~ Female SEER Reg~ 1995~ Black  C34                   1          2      0
#> 4 85 - ~ Male   SEER Reg~ 2010~ Other  C34                   1          1      0
#> # ... with 1 more variable: n_base <dbl> 
#> Check attribute `problems_not_empty` of results to see what strata are affected. 
#> This might be caused by cases where SPC occured at the same day as first cancer. 
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#> 
#>  [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed. 
#> # A tidytable: 1 x 10
#>   age    sex   region     year  t_dco.1 t_site to1month i_observed i_pyar n_base
#>   <chr>  <chr> <chr>      <chr> <chr>   <chr>     <dbl>      <dbl>  <dbl>  <dbl>
#> 1 80 - ~ Male  SEER Reg ~ 2010~ DCO ca~ C34           1          1      0      1 
#> Check attribute `problems_not_empty` of results to see what strata are affected. 
#> This might be caused by cases where SPC occured at the same day as first cancer. 
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#> 
#>  [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed. 
#> # A tidytable: 1 x 10
#>   age    sex   region    year  t_dco.1 t_site Total0toInfyears i_observed i_pyar
#>   <chr>  <chr> <chr>     <chr> <chr>   <chr>             <dbl>      <dbl>  <dbl>
#> 1 80 - ~ Male  SEER Reg~ 2010~ DCO ca~ C34                   1          1      0
#> # ... with 1 more variable: n_base <dbl> 
#> Check attribute `problems_not_empty` of results to see what strata are affected. 
#> This might be caused by cases where SPC occured at the same day as first cancer. 
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#> 
#>  There are observed cases in the results file that do not occur in the refrates_df. 
#> A possible explanation can be: 
#>  - DCO cases 
#>  - diagnosis of second cancer occured in different time period than first cancer 
#> The following strata are affected: 
#> # A tidytable: 166 x 21
#>    yvar_name yvar_label fu_time  age    sex   region     year  t_site i_observed
#>    <chr>     <chr>      <chr>    <chr>  <chr> <chr>      <chr> <chr>       <dbl>
#>  1 total_var Overall    1-5 yea~ 60 - ~ Male  SEER Reg ~ 2010~ C34            17
#>  2 total_var Overall    1-5 yea~ 65 - ~ Male  SEER Reg ~ 2015~ C34            18
#>  3 total_var Overall    5-10 ye~ 60 - ~ Male  SEER Reg ~ 2010~ C34            20
#>  4 total_var Overall    5-10 ye~ 60 - ~ Male  SEER Reg ~ 2010~ C34            22
#>  5 total_var Overall    5-10 ye~ 70 - ~ Male  SEER Reg ~ 2005~ C34            12
#>  6 total_var Overall    5-10 ye~ 75 - ~ Fema~ SEER Reg ~ 2005~ C34            13
#>  7 total_var Overall    10+ yea~ 25 - ~ Fema~ SEER Reg ~ 1995~ C34            14
#>  8 total_var Overall    10+ yea~ 60 - ~ Fema~ SEER Reg ~ 1995~ C34            18
#>  9 total_var Overall    10+ yea~ 60 - ~ Fema~ SEER Reg ~ 1995~ C34            16
#> 10 total_var Overall    10+ yea~ 65 - ~ Male  SEER Reg ~ 2000~ C34            29
#> # ... with 156 more rows, and 12 more variables: i_pyar <dbl>, n_base <dbl>,
#> #   race <chr>, incidence_cases <int>, population_pyar <int>,
#> #   incidence_crude_rate <dbl>, i_expected <dbl>, sir <dbl>, sir_lci <dbl>,
#> #   sir_uci <dbl>, fu_time_sort <int>, yvar_sort <int>
#> 
#>  Check attribute `notes_refcases` of results to see what strata are affected.

# sircalc_results %>% print(n = 100) # uncomment after tidytable 0.5.6 release

Step 9 - Summarize SIR results

#The summarize function is versatile. Her for example the summary by

sircalc_results %>%
  #summarize results across region, age, year and t_site
  summarize_sir_results(.,
                        summarize_groups = c("region", "age", "year", "race"),
                        summarize_site = TRUE,
                        output = "long",  output_information = "minimal",
                        add_total_row = "only",  add_total_fu = "no",
                        collapse_ci = FALSE,  shorten_total_cols = TRUE,
                        fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name",
                        xbreak_var_name = "none", site_var_name = "t_site",
                        alpha = 0.05
                        ) %>%
  dplyr::select(-region, -age, -year, -race, -sex, -yvar_name)
#> Warning: The results file `sir_df` contains observed cases in i_observed that do not occur in the refrates_df (ref_inc_cases).
#> Therefore calculation of the variables n_base and ref_population_pyar is ambiguous. 
#> We take the first value of each variable. Expect small inconsistencies in the calculation of n_base, ref_population_pyar and ref_inc_crude_rate across strata. 
#> If you want to know more, please check the `warnings` column of `sir_df`.
#> # A tidytable: 7 x 8
#>   yvar_label fu_time       fu_time_sort t_site observed expected   sir sir_ci   
#>   <chr>      <chr>                <int> <chr>     <dbl>    <dbl> <dbl> <chr>    
#> 1 Overall    to 1 month               1 Total       327     18.1 18.1  16.19 - ~
#> 2 Overall    0.0833-0.167~            2 Total        80     17.9  4.46 3.54 - 5~
#> 3 Overall    0.167-1 years            3 Total       724    172.   4.2  3.9 - 4.~
#> 4 Overall    1-5 years                4 Total      2998    668.   4.49 4.33 - 4~
#> 5 Overall    5-10 years               5 Total      3089    531.   5.82 5.61 - 6~
#> 6 Overall    10+ years                6 Total      4241    438.   9.69 9.4 - 9.~
#> 7 Overall    Total 0 to I~            7 Total     11459   1845.   6.21 6.1 - 6.~

Built with

sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                    LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] msSPChelpR_0.8.7 magrittr_2.0.1   dplyr_1.0.6     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6        pillar_1.6.1      bslib_0.2.5.1     compiler_4.0.5   
#>  [5] jquerylib_0.1.4   prettyunits_1.1.1 progress_1.2.2    forcats_0.5.1    
#>  [9] tools_4.0.5       digest_0.6.27     jsonlite_1.7.2    lubridate_1.7.10 
#> [13] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2      pkgconfig_2.0.3  
#> [17] rlang_0.4.11      DBI_1.1.1         cli_2.5.0         rstudioapi_0.13  
#> [21] yaml_2.2.1        haven_2.4.1       xfun_0.23         stringr_1.4.0    
#> [25] knitr_1.33        hms_1.1.0         generics_0.1.0    vctrs_0.3.8      
#> [29] sass_0.4.0        sjlabelled_1.1.8  tidyselect_1.1.1  snakecase_0.11.0 
#> [33] glue_1.4.2        data.table_1.14.0 R6_2.5.0          fansi_0.5.0      
#> [37] rmarkdown_2.9     purrr_0.3.4       tidyr_1.1.3       ps_1.6.0         
#> [41] ellipsis_0.3.2    htmltools_0.5.1.1 insight_0.14.1    assertthat_0.2.1 
#> [45] tidytable_0.6.2   utf8_1.2.1        stringi_1.6.2     crayon_1.4.1