The OMOP CDM is a person-centric model. The person table contains records that uniquely identify each individual along with some of their demographic information. Below we create a mock CDM reference which, as is standard, has a person table which contains fields which indicate an individual’s date of birth, gender, race, and ethnicity. Each of these, except for date of birth, are represented by a concept ID (and as the person table contains one record per person these fields are treated as time-invariant).
library(PatientProfiles)
library(duckdb)
library(dplyr)
cdm <- mockPatientProfiles(numberIndividuals = 10000)
cdm$person %>%
dplyr::glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ gender_concept_id <int> 8507, 8532, 8507, 8507, 8507, 8507, 8532, 8532, 8…
## $ year_of_birth <int> 1971, 1902, 1973, 1930, 1959, 1975, 1939, 1909, 1…
## $ race_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
As well as the person table, every CDM reference will include an observation period table. This table contains spans of times during which an individual is considered to being under observation. Individuals can have multiple observation periods, but they cannot overlap.
cdm$observation_period %>%
dplyr::glimpse()
## Rows: ??
## Columns: 5
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ observation_period_start_date <date> 1971-01-01, 1902-01-01, 1973-01-01, 193…
## $ observation_period_end_date <date> 2000-12-13, 1937-03-02, 2000-05-26, 195…
## $ period_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ observation_period_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
When performing analyses we will often be interested in working with the person and observation period tables to identify individuals’ characteristics on some date of interest. PatientProfiles provides a number of functions that can help us do this.
Let’s say we’re working with the condition occurrence table.
cdm$condition_occurrence %>%
glimpse()
## Rows: ??
## Columns: 6
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 1014, 8525, 9150, 2877, 8374, 11, 1067, 1027…
## $ condition_start_date <date> 1950-08-24, 1979-07-15, 1967-04-27, 2020-05…
## $ condition_end_date <date> 1965-04-03, 1982-10-03, 1967-08-30, 2020-10…
## $ condition_occurrence_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id <int> 3, 7, 6, 9, 6, 3, 5, 4, 2, 1, 7, 9, 5, 9, 10…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
This table contains diagnoses of individuals and we might, for
example, want to identify their age on their date of diagnosis. This
involves linking back to the person table which contains their date of
birth (split across three different columns). PatientProfiles provides a
simple function for this. addAge()
will add a new column to
the table containing each patient’s age relative to the specified index
date.
cdm$condition_occurrence <- cdm$condition_occurrence %>%
addAge(indexDate = "condition_start_date")
cdm$condition_occurrence %>%
glimpse()
## Rows: ??
## Columns: 7
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 1014, 8525, 9150, 2877, 11, 1027, 6614, 4827…
## $ condition_start_date <date> 1950-08-24, 1979-07-15, 1967-04-27, 2020-05…
## $ condition_end_date <date> 1965-04-03, 1982-10-03, 1967-08-30, 2020-10…
## $ condition_occurrence_id <int> 1, 2, 3, 4, 6, 8, 9, 10, 13, 14, 15, 16, 17,…
## $ condition_concept_id <int> 3, 7, 6, 9, 3, 4, 2, 1, 5, 9, 10, 9, 5, 3, 1…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 1, 7, 10, 42, 3, 8, 35, 5, 4, 22, 3, 5, 10, …
As well as calculating age, we can also create age groups at the same time. Here we create three age groups: those aged 0 to 17, those 18 to 65, and those 66 or older.
cdm$condition_occurrence <- cdm$condition_occurrence %>%
addAge(
indexDate = "condition_start_date",
ageGroup = list(
"0 to 17" = c(0, 17),
"18 to 65" = c(18, 65),
">= 66" = c(66, Inf)
)
)
cdm$condition_occurrence %>%
glimpse()
## Rows: ??
## Columns: 8
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 1014, 8525, 9150, 2877, 11, 1027, 6614, 4827…
## $ condition_start_date <date> 1950-08-24, 1979-07-15, 1967-04-27, 2020-05…
## $ condition_end_date <date> 1965-04-03, 1982-10-03, 1967-08-30, 2020-10…
## $ condition_occurrence_id <int> 1, 2, 3, 4, 6, 8, 9, 10, 13, 14, 15, 16, 17,…
## $ condition_concept_id <int> 3, 7, 6, 9, 3, 4, 2, 1, 5, 9, 10, 9, 5, 3, 1…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 1, 7, 10, 42, 3, 8, 35, 5, 4, 22, 3, 5, 10, …
## $ age_group <chr> "0 to 17", "0 to 17", "0 to 17", "18 to 65",…
By default, when adding age the new column will have been called “age” and will have been calculated using all available information on date of birth contained in the person. We can though also alter these defaults. Here, for example, we impose that month of birth is January and day of birth is the 1st for all individuals.
cdm$condition_occurrence <- cdm$condition_occurrence %>%
addAge(
indexDate = "condition_start_date",
ageName = "age_from_year_of_birth",
ageMissingMonth = 1,
ageMissingDay = 1,
ageImposeMonth = TRUE,
ageImposeDay = TRUE
)
cdm$condition_occurrence %>%
glimpse()
## Rows: ??
## Columns: 9
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 1014, 8525, 9150, 2877, 11, 1027, 6614, 4827…
## $ condition_start_date <date> 1950-08-24, 1979-07-15, 1967-04-27, 2020-05…
## $ condition_end_date <date> 1965-04-03, 1982-10-03, 1967-08-30, 2020-10…
## $ condition_occurrence_id <int> 1, 2, 3, 4, 6, 8, 9, 10, 13, 14, 15, 16, 17,…
## $ condition_concept_id <int> 3, 7, 6, 9, 3, 4, 2, 1, 5, 9, 10, 9, 5, 3, 1…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 1, 7, 10, 42, 3, 8, 35, 5, 4, 22, 3, 5, 10, …
## $ age_group <chr> "0 to 17", "0 to 17", "0 to 17", "18 to 65",…
## $ age_from_year_of_birth <int> 1, 7, 10, 42, 3, 8, 35, 5, 4, 22, 3, 5, 10, …
As well as age at diagnosis, we might also want identify patients’
sex. PatientProfiles provides the addSex()
function that
will add this for us. Because this is treated as time-invariant, we will
not have to specify any index variable.
cdm$condition_occurrence <- cdm$condition_occurrence %>%
addSex()
cdm$condition_occurrence %>%
glimpse()
## Rows: ??
## Columns: 10
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 8525, 9150, 11, 1027, 6614, 4827, 3585, 3096…
## $ condition_start_date <date> 1979-07-15, 1967-04-27, 1937-10-05, 1960-09…
## $ condition_end_date <date> 1982-10-03, 1967-08-30, 1938-02-11, 1970-06…
## $ condition_occurrence_id <int> 2, 3, 6, 8, 9, 10, 13, 14, 16, 17, 18, 22, 2…
## $ condition_concept_id <int> 7, 6, 3, 4, 2, 1, 5, 9, 9, 5, 3, 3, 6, 4, 9,…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ age_group <chr> "0 to 17", "0 to 17", "0 to 17", "0 to 17", …
## $ age_from_year_of_birth <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ sex <chr> "Female", "Male", "Female", "Female", "Male"…
Similarly, we could also identify whether an individual was in observation at the time of their diagnosis (i.e. had an observation period that overlaps with their diagnosis date), as well as identifying how much prior observation time they had on this date and how much they have following it.
cdm$condition_occurrence <- cdm$condition_occurrence %>%
addInObservation(indexDate = "condition_start_date") %>%
addPriorObservation(indexDate = "condition_start_date") %>%
addFutureObservation(indexDate = "condition_start_date")
cdm$condition_occurrence %>%
glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 8525, 9150, 11, 1027, 6614, 4827, 3585, 3096…
## $ condition_start_date <date> 1979-07-15, 1967-04-27, 1937-10-05, 1960-09…
## $ condition_end_date <date> 1982-10-03, 1967-08-30, 1938-02-11, 1970-06…
## $ condition_occurrence_id <int> 2, 3, 6, 8, 9, 10, 13, 14, 16, 17, 18, 22, 2…
## $ condition_concept_id <int> 7, 6, 3, 4, 2, 1, 5, 9, 9, 5, 3, 3, 6, 4, 9,…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ age_group <chr> "0 to 17", "0 to 17", "0 to 17", "0 to 17", …
## $ age_from_year_of_birth <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ sex <chr> "Female", "Male", "Female", "Female", "Male"…
## $ in_observation <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ prior_observation <int> 2752, 3768, 1373, 3194, 13093, 1865, 1692, 8…
## $ future_observation <int> 6436, 213, 575, 8050, 2823, 8289, 5311, 1473…
For these functions which work with information from the observation
table, it is important to note that the results will be based on the
observation period during which the index date falls within. Moreover,
if a patient is not under observation at the specified date,
addPriorObservation()
and
addFutureObservation()
functions will return NA.
When checking whether someone is in observation the default is that we are checking whether someone was in observation on the index date. We could though expand this and consider a window of time around this date. For example here we add a variable indicating whether someone was in observation from 180 days before the index date to 30 days following it.
cdm$condition_occurrence %>%
addInObservation(
indexDate = "condition_start_date",
window = c(-180, 30)
) %>%
glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 8525, 9150, 11, 1027, 6614, 4827, 3585, 3096…
## $ condition_start_date <date> 1979-07-15, 1967-04-27, 1937-10-05, 1960-09…
## $ condition_end_date <date> 1982-10-03, 1967-08-30, 1938-02-11, 1970-06…
## $ condition_occurrence_id <int> 2, 3, 6, 8, 9, 10, 13, 14, 16, 17, 18, 22, 2…
## $ condition_concept_id <int> 7, 6, 3, 4, 2, 1, 5, 9, 9, 5, 3, 3, 6, 4, 9,…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ age_group <chr> "0 to 17", "0 to 17", "0 to 17", "0 to 17", …
## $ age_from_year_of_birth <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ sex <chr> "Female", "Male", "Female", "Female", "Male"…
## $ prior_observation <int> 2752, 3768, 1373, 3194, 13093, 1865, 1692, 8…
## $ future_observation <int> 6436, 213, 575, 8050, 2823, 8289, 5311, 1473…
## $ in_observation <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
We can also specify a window and require that an individual is present for only some days within it. Here we add a variable indicating whether the individual was in observation at least a year in the future,
cdm$condition_occurrence %>%
addInObservation(
indexDate = "condition_start_date",
window = c(365, Inf),
completeInterval = FALSE
) %>%
glimpse()
## Rows: ??
## Columns: 13
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ person_id <int> 8525, 9150, 11, 1027, 6614, 4827, 3585, 3096…
## $ condition_start_date <date> 1979-07-15, 1967-04-27, 1937-10-05, 1960-09…
## $ condition_end_date <date> 1982-10-03, 1967-08-30, 1938-02-11, 1970-06…
## $ condition_occurrence_id <int> 2, 3, 6, 8, 9, 10, 13, 14, 16, 17, 18, 22, 2…
## $ condition_concept_id <int> 7, 6, 3, 4, 2, 1, 5, 9, 9, 5, 3, 3, 6, 4, 9,…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ age_group <chr> "0 to 17", "0 to 17", "0 to 17", "0 to 17", …
## $ age_from_year_of_birth <int> 7, 10, 3, 8, 35, 5, 4, 22, 5, 10, 13, 4, 3, …
## $ sex <chr> "Female", "Male", "Female", "Female", "Male"…
## $ prior_observation <int> 2752, 3768, 1373, 3194, 13093, 1865, 1692, 8…
## $ future_observation <int> 6436, 213, 575, 8050, 2823, 8289, 5311, 1473…
## $ in_observation <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
The above functions can be used on both standard OMOP CDM tables and cohort tables. Note as the default index date in the functions is “cohort_start_date” we can now omit this.
cdm$cohort1 %>%
glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 1, 3, 1, 1, 1, 3, 1, 3, 2, 1, 2, 1, 2, 3, 3, 1…
## $ subject_id <int> 683, 3950, 8301, 2671, 8693, 8419, 5526, 6348, 33…
## $ cohort_start_date <date> 1997-08-23, 1966-10-24, 2004-10-03, 2032-03-27, …
## $ cohort_end_date <date> 1998-12-13, 1979-06-15, 2027-03-23, 2032-03-30, …
cdm$cohort1 <- cdm$cohort1 %>%
addAge(ageGroup = list(
"0 to 17" = c(0, 17),
"18 to 65" = c(18, 65),
">= 66" = c(66, Inf)
)) %>%
addSex() %>%
addInObservation() %>%
addPriorObservation() %>%
addFutureObservation()
cdm$cohort1 %>%
glimpse()
## Rows: ??
## Columns: 10
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 1, 3, 1, 1, 1, 1, 2, 1, 2, 3, 3, 1, 3, 2, 1, 2…
## $ subject_id <int> 683, 3950, 8301, 2671, 8693, 8419, 6348, 93, 6299…
## $ cohort_start_date <date> 1997-08-23, 1966-10-24, 2004-10-03, 2032-03-27, …
## $ cohort_end_date <date> 1998-12-13, 1979-06-15, 2027-03-23, 2032-03-30, …
## $ age <int> 20, 6, 24, 52, 4, 10, 23, 28, 23, 8, 18, 23, 11, …
## $ age_group <chr> "18 to 65", "0 to 17", "18 to 65", "18 to 65", "0…
## $ sex <chr> "Female", "Male", "Male", "Male", "Female", "Male…
## $ in_observation <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ prior_observation <int> 7539, 2488, 9042, 19079, 1565, 3967, 8506, 10446,…
## $ future_observation <int> 1542, 6961, 9161, 226, 15030, 8832, 2362, 5721, 6…
The above functions, which are chained together, each fetch the
related information one by one. In the cases where we are interested in
adding multiple characteristics, we can add these all at the same time
using the more general addDemographics()
functions. This
will be more efficient that adding characteristics as it requires fewer
joins between our table of interest and the person and observation
period tables.
cdm$cohort2 %>%
glimpse()
## Rows: ??
## Columns: 4
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 1, 3, 1, 1, 2, 3, 3, 1, 2, 3, 3, 3, 3, 1, 2, 3…
## $ subject_id <int> 4879, 9845, 2154, 7439, 6819, 7703, 1183, 5542, 7…
## $ cohort_start_date <date> 2020-01-15, 1960-10-27, 1946-06-15, 1949-12-23, …
## $ cohort_end_date <date> 2023-04-09, 1969-03-14, 1946-07-14, 1964-07-08, …
tictoc::tic()
cdm$cohort2 %>%
addAge(ageGroup = list(
"0 to 17" = c(0, 17),
"18 to 65" = c(18, 65),
">= 66" = c(66, Inf)
)) %>%
addSex() %>%
addInObservation() %>%
addPriorObservation() %>%
addFutureObservation()
## # Source: table<og_250_1726074251> [?? x 10]
## # Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## cohort_definition_id subject_id cohort_start_date cohort_end_date age
## <int> <int> <date> <date> <int>
## 1 2 4879 2020-01-15 2023-04-09 42
## 2 1 9845 1960-10-27 1969-03-14 25
## 3 3 2154 1946-06-15 1946-07-14 39
## 4 1 7439 1949-12-23 1964-07-08 7
## 5 1 6819 1925-11-29 1929-10-19 22
## 6 2 7703 1963-04-23 1969-12-31 4
## 7 3 1183 1991-08-26 1995-12-15 29
## 8 3 5542 1960-09-24 1969-06-17 2
## 9 1 7176 2012-11-27 2026-02-09 33
## 10 2 5478 2008-09-16 2010-05-04 28
## # ℹ more rows
## # ℹ 5 more variables: age_group <chr>, sex <chr>, in_observation <int>,
## # prior_observation <int>, future_observation <int>
tictoc::toc()
## 0.507 sec elapsed
tictoc::tic()
cdm$cohort2 %>%
addDemographics(
age = TRUE,
ageName = "age",
ageGroup = list(
"0 to 17" = c(0, 17),
"18 to 65" = c(18, 65),
">= 66" = c(66, Inf)
),
sex = TRUE,
sexName = "sex",
priorObservation = TRUE,
priorObservationName = "prior_observation",
futureObservation = FALSE,
) %>%
glimpse()
## Rows: ??
## Columns: 8
## Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 1, 3, 1, 1, 2, 3, 3, 1, 2, 3, 3, 3, 3, 2, 3, 1…
## $ subject_id <int> 4879, 9845, 2154, 7439, 6819, 7703, 1183, 5542, 7…
## $ cohort_start_date <date> 2020-01-15, 1960-10-27, 1946-06-15, 1949-12-23, …
## $ cohort_end_date <date> 2023-04-09, 1969-03-14, 1946-07-14, 1964-07-08, …
## $ age <int> 42, 25, 39, 7, 22, 4, 29, 2, 33, 28, 28, 1, 3, 3,…
## $ age_group <chr> "18 to 65", "18 to 65", "18 to 65", "0 to 17", "1…
## $ sex <chr> "Female", "Female", "Female", "Female", "Male", "…
## $ prior_observation <int> 15354, 9431, 14410, 2913, 8368, 1573, 10829, 997,…
tictoc::toc()
## 0.195 sec elapsed
In our small mock dataset we see a small improvement in performance, but this difference will become much more noticeable when working with real data that will typically be far larger.