How to create frequency tables

Matthijs S. Berends

03 June 2019

Introduction

Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq() function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients dataset (included in this AMR package) as example.

Frequencies of one variable

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the gender variable of the septic_patients dataset:

# Any of these will work:
# freq(septic_patients$gender)
# freq(septic_patients[, "gender"])

# Using tidyverse:
# septic_patients$gender %>% freq()
# septic_patients[, "gender"] %>% freq()
# septic_patients %>% freq("gender")

# Probably the fastest and easiest:
septic_patients %>% freq(gender)  

Frequency table of gender from a data.frame (2,000 x 49)

Class: character
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 2

Shortest: 1
Longest: 1

Item Count Percent Cum. Count Cum. Percent
1 M 1,031 51.6% 1,031 51.6%
2 F 969 48.4% 2,000 100.0%

This immediately shows the class of the variable, its length and availability (i.e. the amount of NA), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

Frequencies of more than one variable

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the septic_patients dataset to learn about bacterial properties:

my_patients <- septic_patients %>% left_join_microorganisms()
# Joining, by = "mo"

Now all variables of the microorganisms dataset have been joined to the septic_patients dataset. The microorganisms dataset consists of the following variables:

colnames(microorganisms)
#  [1] "mo"         "col_id"     "fullname"   "kingdom"    "phylum"    
#  [6] "class"      "order"      "family"     "genus"      "species"   
# [11] "subspecies" "rank"       "ref"        "species_id" "source"    
# [16] "prevalence"

If we compare the dimensions between the old and new dataset, we can see that these 15 variables were added:

dim(septic_patients)
# [1] 2000   49
dim(my_patients)
# [1] 2000   64

So now the genus and species variables are available. A frequency table of these combined variables can be created like this:

my_patients %>%
  freq(genus, species, nmax = 15)

Frequency table of genus and species from a data.frame (2,000 x 64)

Columns: 2
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 95

Shortest: 8
Longest: 34

Item Count Percent Cum. Count Cum. Percent
1 Escherichia coli 467 23.4% 467 23.4%
2 Staphylococcus coagulase-negative 313 15.6% 780 39.0%
3 Staphylococcus aureus 235 11.7% 1,015 50.7%
4 Staphylococcus epidermidis 174 8.7% 1,189 59.4%
5 Streptococcus pneumoniae 117 5.8% 1,306 65.3%
6 Staphylococcus hominis 81 4.0% 1,387 69.4%
7 Klebsiella pneumoniae 58 2.9% 1,445 72.2%
8 Enterococcus faecalis 39 2.0% 1,484 74.2%
9 Proteus mirabilis 36 1.8% 1,520 76.0%
10 Pseudomonas aeruginosa 30 1.5% 1,550 77.5%
11 Serratia marcescens 25 1.2% 1,575 78.8%
12 Enterobacter cloacae 23 1.2% 1,598 79.9%
13 Enterococcus faecium 21 1.0% 1,619 81.0%
14 Staphylococcus capitis 21 1.0% 1,640 82.0%
15 Bacteroides fragilis 20 1.0% 1,660 83.0%

(omitted 80 entries, n = 340 [17.0%])

Frequencies of numeric values

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

# # get age distribution of unique patients
septic_patients %>% 
  distinct(patient_id, .keep_all = TRUE) %>% 
  freq(age, nmax = 5, header = TRUE)

Frequency table of age from a data.frame (981 x 49)

Class: numeric
Length: 981 (of which NA: 0 = 0.00%)
Unique: 73

Mean: 71.08
SD: 14.05 (CV: 0.20, MAD: 13.34)
Five-Num: 14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
Outliers: 15 (unique count: 12)

Item Count Percent Cum. Count Cum. Percent
1 83 44 4.5% 44 4.5%
2 76 43 4.4% 87 8.9%
3 75 37 3.8% 124 12.6%
4 82 33 3.4% 157 16.0%
5 78 32 3.3% 189 19.3%

(omitted 68 entries, n = 792 [80.7%])

So the following properties are determined, where NA values are always ignored:

So for example, the above frequency table quickly shows the median age of patients being 74.

Frequencies of factors

To sort frequencies of factors on their levels instead of item count, use the sort.count parameter.

sort.count is TRUE by default. Compare this default behaviour…

septic_patients %>%
  freq(hospital_id)

Frequency table of hospital_id from a data.frame (2,000 x 49)

Class: factor (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Levels: 4: A, B, C, D
Unique: 4

Item Count Percent Cum. Count Cum. Percent
1 D 762 38.1% 762 38.1%
2 B 663 33.2% 1,425 71.2%
3 A 321 16.0% 1,746 87.3%
4 C 254 12.7% 2,000 100.0%

… to this, where items are now sorted on factor levels:

septic_patients %>%
  freq(hospital_id, sort.count = FALSE)

Frequency table of hospital_id from a data.frame (2,000 x 49)

Class: factor (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Levels: 4: A, B, C, D
Unique: 4

Item Count Percent Cum. Count Cum. Percent
1 A 321 16.0% 321 16.0%
2 B 663 33.2% 984 49.2%
3 C 254 12.7% 1,238 61.9%
4 D 762 38.1% 2,000 100.0%

All classes will be printed into the header. Variables with the new rsi class of this AMR package are actually ordered factors and have three classes (look at Class in the header):

septic_patients %>%
  freq(AMX, header = TRUE)

Frequency table of AMX from a data.frame (2,000 x 49)

Class: factor > ordered > rsi (numeric)
Length: 2,000 (of which NA: 771 = 38.55%)
Levels: 3: S < I < R
Unique: 3

Drug: Amoxicillin (AMX, J01CA04)
Group: Beta-lactams/penicillins
%SI: 44.43%

Item Count Percent Cum. Count Cum. Percent
1 R 683 55.6% 683 55.6%
2 S 543 44.2% 1,226 99.8%
3 I 3 0.2% 1,229 100.0%

Frequencies of dates

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

septic_patients %>%
  freq(date, nmax = 5, header = TRUE)

Frequency table of date from a data.frame (2,000 x 49)

Class: Date (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 1,140

Oldest: 2 January 2002
Newest: 28 December 2017 (+5,839)
Median: 31 July 2009 (47.39%)

Item Count Percent Cum. Count Cum. Percent
1 2016-05-21 10 0.5% 10 0.5%
2 2004-11-15 8 0.4% 18 0.9%
3 2013-07-29 8 0.4% 26 1.3%
4 2017-06-12 8 0.4% 34 1.7%
5 2015-11-19 7 0.4% 41 2.0%

(omitted 1,135 entries, n = 1,959 [98.0%])

Assigning a frequency table to an object

A frequency table is actually a regular data.frame, with the exception that it contains an additional class.

my_df <- septic_patients %>% freq(age)
class(my_df)

[1] “freq” “data.frame”

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

dim(my_df)

[1] 74 5

Additional parameters

Parameter na.rm

With the na.rm parameter you can remove NA values from the frequency table (defaults to TRUE, but the number of NA values will always be shown into the header):

Frequency table of AMX from a data.frame (2,000 x 49)

Class: factor > ordered > rsi (numeric)
Length: 2,000 (of which NA: 771 = 38.55%)
Levels: 3: S < I < R
Unique: 4

Drug: Amoxicillin (AMX, J01CA04)
Group: Beta-lactams/penicillins
%SI: 44.43%

Item Count Percent Cum. Count Cum. Percent
1 (NA) 771 38.6% 771 38.6%
2 R 683 34.2% 1,454 72.7%
3 S 543 27.2% 1,997 99.8%
4 I 3 0.2% 2,000 100.0%

Parameter row.names

A frequency table shows row indices. To remove them, use row.names = FALSE:

Frequency table of hospital_id from a data.frame (2,000 x 49)

Class: factor (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Levels: 4: A, B, C, D
Unique: 4

Item Count Percent Cum. Count Cum. Percent
D 762 38.1% 762 38.1%
B 663 33.2% 1,425 71.2%
A 321 16.0% 1,746 87.3%
C 254 12.7% 2,000 100.0%

Parameter markdown

The markdown parameter is TRUE at default in non-interactive sessions, like in reports created with R Markdown. This will always print all rows, unless nmax is set. Without markdown (like in regular R), a frequency table would print like: