Getting started with getTBinR

Sam Abbott

2019-09-03

Using the package

First load the package. We also load several other packages to help quickly explore the data.

library(getTBinR)
library(ggplot2)
library(knitr)
library(magrittr)
library(dplyr)

Getting TB burden data

Get TB burden data with a single function call. This will download the data if it has never been accessed and then save a local copy to R’s temporary directory (see tempdir()). If a local copy exists from the current session then this will be loaded instead.

tb_burden <- get_tb_burden()
#> Loading data from: /tmp/Rtmp90o0YR/tb_burden.rds
#> Loading data from: /tmp/Rtmp90o0YR/mdr_tb.rds
#> Joining TB burden data and MDR TB data.

tb_burden
#> # A tibble: 3,850 x 68
#>    country iso2  iso3  iso_numeric g_whoregion  year e_pop_num e_inc_100k
#>    <chr>   <chr> <chr>       <int> <chr>       <int>     <int>      <dbl>
#>  1 Afghan… AF    AFG             4 Eastern Me…  2000  20093756        190
#>  2 Afghan… AF    AFG             4 Eastern Me…  2001  20966463        189
#>  3 Afghan… AF    AFG             4 Eastern Me…  2002  21979923        189
#>  4 Afghan… AF    AFG             4 Eastern Me…  2003  23064851        189
#>  5 Afghan… AF    AFG             4 Eastern Me…  2004  24118979        189
#>  6 Afghan… AF    AFG             4 Eastern Me…  2005  25070798        189
#>  7 Afghan… AF    AFG             4 Eastern Me…  2006  25893450        189
#>  8 Afghan… AF    AFG             4 Eastern Me…  2007  26616792        189
#>  9 Afghan… AF    AFG             4 Eastern Me…  2008  27294031        189
#> 10 Afghan… AF    AFG             4 Eastern Me…  2009  28004331        189
#> # … with 3,840 more rows, and 60 more variables: e_inc_100k_lo <dbl>,
#> #   e_inc_100k_hi <dbl>, e_inc_num <int>, e_inc_num_lo <int>,
#> #   e_inc_num_hi <int>, e_tbhiv_prct <dbl>, e_tbhiv_prct_lo <dbl>,
#> #   e_tbhiv_prct_hi <dbl>, e_inc_tbhiv_100k <dbl>,
#> #   e_inc_tbhiv_100k_lo <dbl>, e_inc_tbhiv_100k_hi <dbl>,
#> #   e_inc_tbhiv_num <int>, e_inc_tbhiv_num_lo <int>,
#> #   e_inc_tbhiv_num_hi <int>, e_mort_exc_tbhiv_100k <dbl>,
#> #   e_mort_exc_tbhiv_100k_lo <dbl>, e_mort_exc_tbhiv_100k_hi <dbl>,
#> #   e_mort_exc_tbhiv_num <int>, e_mort_exc_tbhiv_num_lo <int>,
#> #   e_mort_exc_tbhiv_num_hi <int>, e_mort_tbhiv_100k <dbl>,
#> #   e_mort_tbhiv_100k_lo <dbl>, e_mort_tbhiv_100k_hi <dbl>,
#> #   e_mort_tbhiv_num <int>, e_mort_tbhiv_num_lo <int>,
#> #   e_mort_tbhiv_num_hi <int>, e_mort_100k <dbl>, e_mort_100k_lo <dbl>,
#> #   e_mort_100k_hi <dbl>, e_mort_num <int>, e_mort_num_lo <int>,
#> #   e_mort_num_hi <int>, cfr <dbl>, cfr_lo <dbl>, cfr_hi <dbl>,
#> #   c_newinc_100k <dbl>, c_cdr <dbl>, c_cdr_lo <dbl>, c_cdr_hi <dbl>,
#> #   source_rr_new <chr>, source_drs_coverage_new <chr>,
#> #   source_drs_year_new <int>, e_rr_pct_new <dbl>, e_rr_pct_new_lo <dbl>,
#> #   e_rr_pct_new_hi <dbl>, e_mdr_pct_rr_new <int>, source_rr_ret <chr>,
#> #   source_drs_coverage_ret <chr>, source_drs_year_ret <int>,
#> #   e_rr_pct_ret <dbl>, e_rr_pct_ret_lo <dbl>, e_rr_pct_ret_hi <dbl>,
#> #   e_mdr_pct_rr_ret <int>, e_inc_rr_num <int>, e_inc_rr_num_lo <int>,
#> #   e_inc_rr_num_hi <int>, e_mdr_pct_rr <int>,
#> #   e_rr_in_notified_pulm <int>, e_rr_in_notified_pulm_lo <int>,
#> #   e_rr_in_notified_pulm_hi <int>

Getting additional datasets

On top of the core datasets provided by default, getTBinR also supports importing multiple other datasets. These include data on latent TB, HIV surveillance, intervention budgets, and outcomes. The currently supported datasets are listed below,

knitr::kable(available_datasets[, 1:4])
dataset description timespan default
Estimates Generated estimates of TB mortality, incidence, case fatality ratio, and treatment coverage (previously called case detection rate). Data available split by HIV status. 2000-2017 yes
Estimates Generated estimates for the proportion of TB cases that have rifampicin-resistant TB (RR-TB, which includes cases with multidrug-resistant TB, MDR-TB), RR/MDR-TB among notified pulmonary TB cases. 2017 yes
Incidence by age and sex Generated estimates of TB incidence stratified by age and sex. This dataset is currently experimental. 2017 no
Latent TB infection Generated estimates incidence of latent TB stratified by age. 2017 no
Notification TB notification dataset linking to TB notifications as raw numbers. Age-stratified, with good data dictionary coverage but has large amounts of missing data. 1980-2017 no
Drug resistance surveillance Country level drug resistance surveillance. Lists drug resistance data from country level reporting. Good data dictionary coverage but has large amounts of missing data. 2017 no
Non-routine HIV surveillance Country level, non-routine HIV surveillance data. Good data dictionary coverage but with a large amount of missing data. 2007-2017 no
Outcomes Country level TB outcomes data. Lists numeric outcome data, very messy but with good data dictionary coverage. 1994-2017 no
Budget Current year TB intervention budgets per country. Many of the data fields are cryptic but has good data dictionary coverage. 2018 no
Expenditure and utilisation Previous year expenditure on TB interventions. Highly detailed, with good data dictionary coverage but lots of missing data. 2017 no
Policies and services Lists TB policies that have been implemented per country. Highly detailed, with good data dictionary coverage but lots of missing data. 2017 no
Community engagement Lists community engagement programmes. Highly detailed, with good data dictionary coverage but lots of missing data. 2013-2017 no
Laboratories Country specific laboratory data. Highly detailed, with good data dictionary coverage but lots of missing data. 2009-2017 no

These datasets can be imported into R by supplying the name of the required dataset to the additional_datasets argument of get_tb_burden (or any of the various plotting/summary functions). Alternatively, they can all be imported in one go using additional_datasets = "all", as below,

get_tb_burden(additional_datasets = "all", verbose = FALSE)
#> # A tibble: 8,290 x 461
#>    country iso2  iso3  iso_numeric g_whoregion  year e_pop_num e_inc_100k
#>    <chr>   <chr> <chr>       <int> <chr>       <int>     <int>      <dbl>
#>  1 Afghan… AF    AFG             4 Eastern Me…  2000  20093756        190
#>  2 Afghan… AF    AFG             4 Eastern Me…  2001  20966463        189
#>  3 Afghan… AF    AFG             4 Eastern Me…  2002  21979923        189
#>  4 Afghan… AF    AFG             4 Eastern Me…  2003  23064851        189
#>  5 Afghan… AF    AFG             4 Eastern Me…  2004  24118979        189
#>  6 Afghan… AF    AFG             4 Eastern Me…  2005  25070798        189
#>  7 Afghan… AF    AFG             4 Eastern Me…  2006  25893450        189
#>  8 Afghan… AF    AFG             4 Eastern Me…  2007  26616792        189
#>  9 Afghan… AF    AFG             4 Eastern Me…  2008  27294031        189
#> 10 Afghan… AF    AFG             4 Eastern Me…  2009  28004331        189
#> # … with 8,280 more rows, and 453 more variables: e_inc_100k_lo <dbl>,
#> #   e_inc_100k_hi <dbl>, e_inc_num <int>, e_inc_num_lo <int>,
#> #   e_inc_num_hi <int>, e_tbhiv_prct <dbl>, e_tbhiv_prct_lo <dbl>,
#> #   e_tbhiv_prct_hi <dbl>, e_inc_tbhiv_100k <dbl>,
#> #   e_inc_tbhiv_100k_lo <dbl>, e_inc_tbhiv_100k_hi <dbl>,
#> #   e_inc_tbhiv_num <int>, e_inc_tbhiv_num_lo <int>,
#> #   e_inc_tbhiv_num_hi <int>, e_mort_exc_tbhiv_100k <dbl>,
#> #   e_mort_exc_tbhiv_100k_lo <dbl>, e_mort_exc_tbhiv_100k_hi <dbl>,
#> #   e_mort_exc_tbhiv_num <int>, e_mort_exc_tbhiv_num_lo <int>,
#> #   e_mort_exc_tbhiv_num_hi <int>, e_mort_tbhiv_100k <dbl>,
#> #   e_mort_tbhiv_100k_lo <dbl>, e_mort_tbhiv_100k_hi <dbl>,
#> #   e_mort_tbhiv_num <int>, e_mort_tbhiv_num_lo <int>,
#> #   e_mort_tbhiv_num_hi <int>, e_mort_100k <dbl>, e_mort_100k_lo <dbl>,
#> #   e_mort_100k_hi <dbl>, e_mort_num <int>, e_mort_num_lo <int>,
#> #   e_mort_num_hi <int>, cfr <dbl>, cfr_lo <dbl>, cfr_hi <dbl>,
#> #   c_newinc_100k <dbl>, c_cdr <dbl>, c_cdr_lo <dbl>, c_cdr_hi <dbl>,
#> #   source_rr_new <chr>, source_drs_coverage_new <chr>,
#> #   source_drs_year_new <int>, e_rr_pct_new <dbl>, e_rr_pct_new_lo <dbl>,
#> #   e_rr_pct_new_hi <dbl>, e_mdr_pct_rr_new <int>, source_rr_ret <chr>,
#> #   source_drs_coverage_ret <chr>, source_drs_year_ret <int>,
#> #   e_rr_pct_ret <dbl>, e_rr_pct_ret_lo <dbl>, e_rr_pct_ret_hi <dbl>,
#> #   e_mdr_pct_rr_ret <int>, e_inc_rr_num <int>, e_inc_rr_num_lo <int>,
#> #   e_inc_rr_num_hi <int>, e_mdr_pct_rr <int>,
#> #   e_rr_in_notified_pulm <int>, e_rr_in_notified_pulm_lo <int>,
#> #   e_rr_in_notified_pulm_hi <int>, source_hh <chr>, e_hh_size <dbl>,
#> #   prevtx_data_available <int>, newinc_con04_prevtx <int>,
#> #   ptsurvey_newinc <lgl>, ptsurvey_newinc_con04_prevtx <lgl>,
#> #   e_prevtx_eligible <dbl>, e_prevtx_eligible_lo <dbl>,
#> #   e_prevtx_eligible_hi <dbl>, e_prevtx_kids_pct <dbl>,
#> #   e_prevtx_kids_pct_lo <dbl>, e_prevtx_kids_pct_hi <dbl>, new_sp <int>,
#> #   new_sn <int>, new_su <int>, new_ep <int>, new_oth <int>,
#> #   ret_rel <int>, ret_taf <int>, ret_tad <int>, ret_oth <int>,
#> #   newret_oth <int>, new_labconf <int>, new_clindx <int>,
#> #   ret_rel_labconf <int>, ret_rel_clindx <int>, ret_rel_ep <int>,
#> #   ret_nrel <int>, notif_foreign <int>, c_newinc <int>, new_sp_m04 <int>,
#> #   new_sp_m514 <int>, new_sp_m014 <int>, new_sp_m1524 <int>,
#> #   new_sp_m2534 <int>, new_sp_m3544 <int>, new_sp_m4554 <int>,
#> #   new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_mu <int>, …

Once imported, these datasets can be used in the plotting and summary functions provided by getTBinR (by passing them to their df argument or using the additional_datasets argument in each function).

Searching for variable definitions

The WHO provides a large, detailed, data dictionary for use with the TB burden data. However, searching through this dataset can be tedious. To streamline this process getTBinR provides a search function to find the definition of a single or multiple variables. Again if not previously used this function will download the data dictionary to the temporary directory, but in subsequent uses will load a local copy.

vars_of_interest <- search_data_dict(var = c("country",
                                             "e_inc_100k",
                                             "e_inc_100k_lo",
                                             "e_inc_100k_hi"))
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 4 results found for your variable search for country, e_inc_100k, e_inc_100k_lo, e_inc_100k_hi

knitr::kable(vars_of_interest)
variable_name dataset code_list definition
country Country identification Country or territory name
e_inc_100k Estimates Estimated incidence (all forms) per 100 000 population
e_inc_100k_hi Estimates Estimated incidence (all forms) per 100 000 population, high bound
e_inc_100k_lo Estimates Estimated incidence (all forms) per 100 000 population, low bound

We might also want to search the variable definitions for key phrases, for example mortality.

defs_of_interest <- search_data_dict(def = c("mortality"))
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 9 results found for your definition search for mortality

knitr::kable(defs_of_interest)
variable_name dataset code_list definition
e_mort_100k Estimates Estimated mortality of TB cases (all forms) per 100 000 population
e_mort_100k_hi Estimates Estimated mortality of TB cases (all forms) per 100 000 population, high bound
e_mort_100k_lo Estimates Estimated mortality of TB cases (all forms) per 100 000 population, low bound
e_mort_exc_tbhiv_100k Estimates Estimated mortality of TB cases (all forms, excluding HIV) per 100 000 population
e_mort_exc_tbhiv_100k_hi Estimates Estimated mortality of TB cases (all forms, excluding HIV), per 100 000 population, high bound
e_mort_exc_tbhiv_100k_lo Estimates Estimated mortality of TB cases (all forms, excluding HIV), per 100 000 population, low bound
e_mort_tbhiv_100k Estimates Estimated mortality of TB cases who are HIV-positive, per 100 000 population
e_mort_tbhiv_100k_hi Estimates Estimated mortality of TB cases who are HIV-positive, per 100 000 population, high bound
e_mort_tbhiv_100k_lo Estimates Estimated mortality of TB cases who are HIV-positive, per 100 000 population, low bound

Finally we could both search for a known variable and for key phrases in variable definitions.

vars_defs_of_interest <- search_data_dict(var = c("country"),
                                     def = c("mortality"))
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 1 results found for your variable search for country
#> 9 results found for your definition search for mortality

knitr::kable(vars_defs_of_interest)
variable_name dataset code_list definition
country Country identification Country or territory name
e_mort_100k Estimates Estimated mortality of TB cases (all forms) per 100 000 population
e_mort_100k_hi Estimates Estimated mortality of TB cases (all forms) per 100 000 population, high bound
e_mort_100k_lo Estimates Estimated mortality of TB cases (all forms) per 100 000 population, low bound
e_mort_exc_tbhiv_100k Estimates Estimated mortality of TB cases (all forms, excluding HIV) per 100 000 population
e_mort_exc_tbhiv_100k_hi Estimates Estimated mortality of TB cases (all forms, excluding HIV), per 100 000 population, high bound
e_mort_exc_tbhiv_100k_lo Estimates Estimated mortality of TB cases (all forms, excluding HIV), per 100 000 population, low bound
e_mort_tbhiv_100k Estimates Estimated mortality of TB cases who are HIV-positive, per 100 000 population
e_mort_tbhiv_100k_hi Estimates Estimated mortality of TB cases who are HIV-positive, per 100 000 population, high bound
e_mort_tbhiv_100k_lo Estimates Estimated mortality of TB cases who are HIV-positive, per 100 000 population, low bound

Searching for dataset details

search_data_dict can also be used to explore the details of the variables included in each dataset. For example if we could explore all the variables included in the Latent TB dataset,

dataset_of_interest <- search_data_dict(dataset = "Latent")
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 12 results found for your dataset search for Latent

knitr::kable(dataset_of_interest)
variable_name dataset code_list definition
e_prevtx_kids_pct Latent TB infection Estimated % of children received TB preventive therapy aged under 5 who are household contacts of TB cases and who are eligible for TB preventive therapy
e_prevtx_kids_pct_hi Latent TB infection Estimated % of children received TB preventive therapy aged under 5 who are household contacts of TB cases and who are eligible for TB preventive therapy: High bound
e_prevtx_kids_pct_lo Latent TB infection Estimated % of children received TB preventive therapy aged under 5 who are household contacts of TB cases and who are eligible for TB preventive therapy: Low bound
e_hh_size Latent TB infection Estimated average household size
e_prevtx_eligible Latent TB infection Estimated number of children aged under 5 who are household contacts of TB cases who are eligible for TB preventive therapy
e_prevtx_eligible_hi Latent TB infection Estimated number of children aged under 5 who are household contacts of TB cases who are eligible for TB preventive therapy: high bound
e_prevtx_eligible_lo Latent TB infection Estimated number of children aged under 5 who are household contacts of TB cases who are eligible for TB preventive therapy: low bound
newinc_con04_prevtx Latent TB infection (If prevtx_data_available=60) Number of children aged under 5 started on TB preventive therapy who are household contacts of bacteriologically-confirmed new and relapse TB cases notified
prevtx_data_available Latent TB infection 0=No; 60= Yes available from the routine surveillance system; 61=Yes estimated from a survey of a random sample of medical records or treatment cards of TB patients representative of the national TB patient population Are data available on the number of children aged under 5 who are household contacts of TB cases and started on TB preventive therapy?
ptsurvey_newinc Latent TB infection (If prevtx_data_available=61) Number of bacteriologically-confirmed TB new and relapse cases notified in the reporting year whose medical records or treatment cards were included in a survey
ptsurvey_newinc_con04_prevtx Latent TB infection (If prevtx_data_available=61) Number of children aged under 5 started on TB preventive therapy who are household contacts of the TB cases in ptsurvey_newinc
source_hh Latent TB infection Source of estimate of average household size

Mapping Global Incidence Rates

To start exploring the WHO TB data we map, the most recently available, global TB incidence rates. Mapping data can help identify spatial patterns.

getTBinR::map_tb_burden(metric = "e_inc_100k")
#> Loading data from: /tmp/Rtmp90o0YR/tb_burden.rds
#> Loading data from: /tmp/Rtmp90o0YR/mdr_tb.rds
#> Joining TB burden data and MDR TB data.
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 1 results found for your variable search for e_inc_100k

Plotting Incidence Rates for All Countries

To showcase how quickly we can go from no data to plotting informative graphs we quickly explore incidence rates for all countries in the WHO data.

getTBinR::plot_tb_burden_overview(metric = "e_inc_100k",
                                  interactive = FALSE)
#> Loading data from: /tmp/Rtmp90o0YR/tb_burden.rds
#> Loading data from: /tmp/Rtmp90o0YR/mdr_tb.rds
#> Joining TB burden data and MDR TB data.
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 1 results found for your variable search for e_inc_100k

Another way to compare incidence rates in countries is to look at the annual percentage change. The plot below only shows countries with a maximum incidence rate above 5 per 100,000.

higher_burden_countries <- tb_burden %>% 
  group_by(country) %>% 
  summarise(e_inc_100k = min(e_inc_100k)) %>% 
  filter(e_inc_100k > 5) %>% 
  pull(country) %>% 
  unique

getTBinR::plot_tb_burden_overview(metric = "e_inc_100k",
                                  interactive = FALSE,
                                  annual_change = TRUE,
                                  countries = higher_burden_countries)
#> Loading data from: /tmp/Rtmp90o0YR/tb_burden.rds
#> Loading data from: /tmp/Rtmp90o0YR/mdr_tb.rds
#> Joining TB burden data and MDR TB data.
#> Loading data from: /tmp/Rtmp90o0YR/dictionary.rds
#> 1 results found for your variable search for e_inc_100k