There has been a long relationship between Epidemiology / Public Health and Biostatistics. Students frequently find introductory textbooks explaining statistical methods and the math behind them, but how to implement those techniques on a computer is rarely explained.

One of the most popular statistical software’s in public health
settings is `Stata`

. `Stata`

has the advantage of
offering a solid interface with functions that can be accessed via the
use of commands or by selecting the proper input in the graphical unit
interface (GUI). Furthermore, `Stata`

offers *“Grad
Plans”* to postgraduate students, making it relatively affordable
from an economic point of view.

`R`

is a free alternative to `Stata`

. Its use
keeps growing, and its popularity is also increasing due to the
relatively high number of packages and textbooks available.

In epidemiology, some good packages are already available for
`R`

, including: `Epi`

, `epibasix`

,
`epiDisplay`

, `epiR`

and `epitools`

.
The `pubh`

package does not intend to replace any of them,
but to only provide a standard syntax for the most frequent statistical
analysis in epidemiology.

Most students and professionals from the disciplines of Epidemiology and Public Health analyse variables in terms of outcome, exposure and confounders. The following table shows the most common names used in the literature to characterise variables in a cause-effect relationships:

Response variable |
Explanatory variable(s) |
---|---|

Outcome | Exposure and confounders |

Outcome | Predictors |

Dependent variable | Independent variable(s) |

y | x |

In `R`

, `formulas`

declare relationships
between variables. Formulas are also common in classic statistical tests
and regression methods.

Formulas have the following standard syntax:

Where `y`

is the outcome or response variable,
`x`

is the exposure or predictor of interest, and
`my_data`

specifies the data frame’s name where
`x`

and `y`

can be found.

The symbol `~`

is used in `R`

for formulas. It
can be interpreted as *depends on*. In the most typical scenario,
`y ~ x`

means `y`

depends on `x`

or
`y`

is a function of `x`

:

`y = f(x)`

Using epidemiology friendly terms:

`Outcome = f(Exposure)`

It is worth noting that `Stata`

requires variables to be
given in the same order as in formulas: outcome first, predictors
next.

The `pubh`

package integrates well with other packages of
the `tidyverse`

which use formulas and the pipe operator
`|>`

to add layers over functions. In particular,
`pubh`

uses `ggformula`

as a graphical interface
for plotting and takes advantage of variable labels from
`sjlabelled`

. This versatility allows it to interact also
with tables from `crosstable`

.

One way to control for confounders is the use of stratification. In
`Stata`

, stratification is done by including the
`by`

option as part of the command. In the
`ggformula`

package, one way of doing stratification is with
a formula like:

Where `y`

is the outcome or response variable,
`x`

is the exposure or predictor of interest, `z`

is the stratification variable (a factor) and `my_data`

specifies the name of the data frame where `x`

,
`y`

and `z`

can be found.

`tidyverse`

The `tidyverse`

has become now the standard in data
manipulation in `R`

. The use of the pipe function
`|>`

allows for cleaner code. The principle of pipes is to
change the paradigm in coding. In standard codding, when many functions
are used, one goes from inner parenthesis to outer ones.

For example if we have three functions, a common syntax would look like:

`f3(f2(f1(..., data), ...), ...)`

With pipes, however, the code reads top to bottom and left to right:

Most of the functions from `pubh`

are compatible with
pipes and the `tidyverse`

.

`pubh`

packageThe main contributions of the `pubh`

package to students
and professionals from the disciplines of Epidemiology and Public Health
are:

- Use of a common syntax for the most used analysis.
- A function,
`glm_coef`

that displays coefficients from most common regression analysis in a way that can be easy interpreted and used for publications.

There are many options currently available for displaying descriptive
statistics. I recommend the function `crosstable`

from the
`crosstable`

package for constructing tables of descriptive
statistics where results need to be stratified.

The `estat`

function from the `pubh`

package
displays descriptive statistics of continuous outcomes; it can use
*labels* to display nice tables. To aid in understanding the
variability, `estat`

also shows the relative dispersion
(coefficient of variation), which is particularly interesting for
comparing variances between groups (factors).

Some examples. We start by loading packages.

```
rm(list = ls())
library(dplyr)
library(rstatix)
library(crosstable)
library(pubh)
library(sjlabelled)
```

We will use a data set about a study of onchocerciasis in Sierra Leone.

```
## # A tibble: 6 × 7
## id mf area agegrp sex mfload lesions
## <dbl> <fct> <fct> <fct> <fct> <dbl> <fct>
## 1 1 Infected Savannah 20-39 Female 1 No
## 2 2 Infected Rainforest 40+ Male 3 No
## 3 3 Infected Savannah 40+ Female 1 No
## 4 4 Not-infected Rainforest 20-39 Female 0 No
## 5 5 Not-infected Savannah 40+ Female 0 No
## 6 6 Not-infected Rainforest 20-39 Female 0 No
```

```
crosstable_options(
total = "row",
percent_pattern="{n} ({p_col})",
percent_digits = 1,
funs = c("Mean (std)" = meansd, "Median [IQR]" = mediqr)
)
```

A two-by-two contingency table:

```
Oncho |>
select(mf, area) |>
mutate(
mf = relevel(mf, ref = "Infected")
) |>
copy_labels(Oncho) |>
crosstable(by = area) |>
ctf()
```

```
## Warning: `cross_to_flextable()` was deprecated in crosstable 0.1.0.
## ℹ Please use `as_flextable()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

label | variable | Residence | Total | |
---|---|---|---|---|

Savannah | Rainforest | |||

Infection | Infected | 281 (51.3%) | 541 (71.8%) | 822 (63.1%) |

Not-infected | 267 (48.7%) | 213 (28.2%) | 480 (36.9%) |

Table with all descriptive statistics except `id`

and
`mfload`

:

```
Oncho |>
select(- c(id, mfload)) |>
mutate(
mf = relevel(mf, ref = "Infected")
) |>
copy_labels(Oncho) |>
crosstable(by = area) |>
ctf()
```

label | variable | Residence | Total | |
---|---|---|---|---|

Savannah | Rainforest | |||

Infection | Infected | 281 (51.3%) | 541 (71.8%) | 822 (63.1%) |

Not-infected | 267 (48.7%) | 213 (28.2%) | 480 (36.9%) | |

Age group (years) | 5-9 | 93 (17.0%) | 109 (14.5%) | 202 (15.5%) |

10-19 | 72 (13.1%) | 146 (19.4%) | 218 (16.7%) | |

20-39 | 208 (38.0%) | 216 (28.6%) | 424 (32.6%) | |

40+ | 175 (31.9%) | 283 (37.5%) | 458 (35.2%) | |

Sex | Male | 247 (45.1%) | 369 (48.9%) | 616 (47.3%) |

Female | 301 (54.9%) | 385 (51.1%) | 686 (52.7%) | |

Severe eye lesions? | No | 481 (87.8%) | 620 (82.2%) | 1101 (84.6%) |

Yes | 67 (12.2%) | 134 (17.8%) | 201 (15.4%) |

Next, we use a data set about blood counts of T cells from patients
in remission from Hodgkin’s disease or in remission from disseminated
malignancies. We generate the new variable `Ratio`

and add
labels to the variables.

```
data(Hodgkin)
Hodgkin <- Hodgkin |>
mutate(Ratio = CD4/CD8) |>
var_labels(
Ratio = "CD4+ / CD8+ T-cells ratio"
)
Hodgkin |> head()
```

```
## # A tibble: 6 × 4
## CD4 CD8 Group Ratio
## <int> <int> <fct> <dbl>
## 1 396 836 Hodgkin 0.474
## 2 568 978 Hodgkin 0.581
## 3 1212 1678 Hodgkin 0.722
## 4 171 212 Hodgkin 0.807
## 5 554 670 Hodgkin 0.827
## 6 1104 1335 Hodgkin 0.827
```

Descriptive statistics for CD4+ T cells:

```
## N Min Max Mean Median SD CV
## 1 CD4+ T-cells 40 116 2415 672.62 528.5 470.49 0.7
```

In the previous code, the left-hand side of the formula is empty as it’s the case when working with only one variable.

For stratification, `estat`

recognises the following two
syntaxes:

`outcome ~ exposure`

`~ outcome | exposure`

where, `outcome`

is continuous and `exposure`

is a categorical (`factor`

) variable.

For example:

```
## Disease N Min Max Mean Median SD CV
## 1 CD4+ / CD8+ T-cells ratio Non-Hodgkin 20 1.10 3.49 2.12 2.15 0.73 0.34
## 2 Hodgkin 20 0.47 3.82 1.50 1.19 0.91 0.61
```

As before, we can report a table of descriptive statistics for all variables in the data set:

```
Hodgkin |>
mutate(
Group = relevel(Group, ref = "Hodgkin")
) |>
copy_labels(Hodgkin) |>
crosstable(by = Group) |>
ctf()
```

label | variable | Disease | Total | |
---|---|---|---|---|

Hodgkin | Non-Hodgkin | |||

CD4+ T-cells | Mean (std) | 823.2 (566.4) | 522.0 (293.0) | 672.6 (470.5) |

Median [IQR] | 681.5 [396.8;1131.0] | 433.0 [360.0;709.0] | 528.5 [375.0;916.0] | |

CD8+ T-cells | Mean (std) | 613.9 (397.9) | 260.0 (154.7) | 436.9 (347.7) |

Median [IQR] | 447.5 [304.8;817.2] | 231.5 [149.8;322.5] | 319.0 [209.0;588.0] | |

CD4+ / CD8+ T-cells ratio | Mean (std) | 1.5 (0.9) | 2.1 (0.7) | 1.8 (0.9) |

Median [IQR] | 1.2 [0.8;2.0] | 2.2 [1.6;2.7] | 1.7 [1.1;2.3] |

From the descriptive statistics of *Ratio*, we know that the
relative dispersion in the Hodgkin group is almost as high as the
relative dispersion in the non-Hodgkin group.

For new users of `R`

, it helps to have a standard syntax
in most of the commands, as `R`

could sometimes be
challenging and even intimidating. We can test if the difference in
variance is statistically significant with the `var.test`

command, which uses can use a formula syntax:

```
##
## F test to compare two variances
##
## data: Ratio by Group
## F = 0.6294, num df = 19, denom df = 19, p-value = 0.3214
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.2491238 1.5901459
## sample estimates:
## ratio of variances
## 0.6293991
```

What about normality? We can look at the QQ-plots against the standard Normal distribution.

Let’s say we choose a non-parametric test to compare the groups:

```
##
## Wilcoxon rank sum exact test
##
## data: Ratio by Group
## W = 298, p-value = 0.007331
## alternative hypothesis: true location shift is not equal to 0
```

For relatively small samples (for example, less than 30 observations
per group), it is a standard practice to show the actual data in dot
plots with error bars. The `pubh`

package offers two options
to show graphically differences in continuous outcomes among groups:

- For small samples:
`strip_error`

- For medium to large samples:
`bar_error`

For our current example:

The error bars represent 95% confidence intervals around mean values.

Adding a line on top is relatively easy to show that the two groups
are significantly different. The function `gf_star`

needs the
reference point on how to draw an horizontal line to display statistical
differences or to annotate a plot; in summary, `gf_star`

:

- Draws an horizontal line from \((x1, y2)\) to \((x2, y2)\).
- Draws a vertical line from \((x1, y1)\) to \((x1, y1)\).
- Draws a vertical line from \((x2, y1)\) to \((x2, y1)\).
- Writes a character (the legend or “star”) at the mid point between \(x1\) and \(x2\) at high \(y3\).

Thus:

\[ y1 < y2 < y3 \]

In our current example:

For larger samples we could use bar charts with error bars. For example:

```
data(birthwt, package = "MASS")
birthwt <- birthwt |>
mutate(
smoke = factor(smoke, labels = c("Non-smoker", "Smoker")),
Race = factor(race > 1, labels = c("White", "Non-white")),
race = factor(race, labels = c("White", "Afican American", "Other"))
) |>
var_labels(
bwt = 'Birth weight (g)',
smoke = 'Smoking status',
race = 'Race',
)
```

Quick normality check:

The (unadjusted) \(t\)-test:

```
## estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p
## 1 283.7767 3055.696 2771.919 bwt Non-smoker Smoker 115 74 2.729886 0.007
## df conf.low conf.high method alternative
## 1 170.1002 78.57486 488.9786 T-test two.sided
```

The final plot with annotation to highlight statistical difference (unadjusted):

Both `strip_error`

and `bar_error`

can generate
plots stratified by a third variable, for example:

The `pubh`

package includes the function
`cosm_reg`

, which adds some cosmetics to objects generated by
`tbl_regression`

and `huxtable`

. The function
`multiple`

helps analyse and visualise multiple
comparisons.

For simplicity, here we show the analysis of the linear model of smoking on birth weight, adjusting by race (and not by other potential confounders).

```
## Parameter Coefficient Pr(>|t|)
## 1 Constant 3334.95 (3153.89, 3516.01) < 0.001
## 2 Smoking status: Smoker -428.73 (-643.86, -213.6) < 0.001
## 3 Race: Afican American -450.36 (-752.45, -148.27) 0.004
## 4 Race: Other -452.88 (-682.67, -223.08) < 0.001
```

In the last table of coefficients, confidence intervals and p-values
for race levels are not adjusted for multiple comparisons. We can use
functions from the `emmeans`

package to make the
corrections.

```
## contrast estimate SE t.ratio p.value lower.CL upper.CL
## 1 Afican American - White -450.36 153.12 -2.94 0.01 -810.85 -89.87
## 2 Other - White -452.88 116.48 -3.89 < 0.001 -727.10 -178.66
## 3 Other - Afican American -2.52 160.59 -0.02 1 -380.60 375.57
```