Data required for epidemiological investigations are usually stored
in different databases. Consolidating these datasets is often the first
step for these investigations. This vignette provides a brief
introduction to the basics of record linkage as implemented by
diyar
.
Let’s begin by reviewing missing_staff_id
- a sample
dataset containing incomplete staff information.
data(missing_staff_id)
missing_staff_id#> r_id staff_id age initials hair_colour branch_office source_1 source_2
#> 1 1 NA 30 G.D. Brown Republic of Ghana A 3
#> 2 2 NA 30 B.G. Teal France A 1
#> 3 3 NA 30 X.P. <NA> <NA> A 1
#> 4 4 NA 30 X.P. Green <NA> B 1
#> 5 5 NA 30 <NA> Green France A 1
#> 6 6 2 30 G.D. Dark brown Ghana A 1
#> 7 7 2 30 G.D. Brown Republic of Ghana B 2
A unique identifier that distinguishes one entity (staff) from
another is often unavailable or incomplete as is the case with
staff_id
in this example. links()
is used to
create one. The identifier is created as an S4
class
(pid
) with useful information about each group in its
slots.
The simplest strategy would be to select one attribute as a distinguishing characteristic for each entity. This is the simple deterministic approach to record linkage.
In the example below, we use initials
and
hair_colour
as distinguishing characteristics.
$p1 <- links(criteria = missing_staff_id$initials)
missing_staff_id$p2 <- links(criteria = missing_staff_id$hair_colour)
missing_staff_idc("initials", "hair_colour", "p1", "p2")]
missing_staff_id[#> initials hair_colour p1 p2
#> 1 G.D. Brown P.1 (CRI 001) P.1 (CRI 001)
#> 2 B.G. Teal P.2 (No hits) P.2 (No hits)
#> 3 X.P. <NA> P.3 (CRI 001) P.3 (No hits)
#> 4 X.P. Green P.3 (CRI 001) P.4 (CRI 001)
#> 5 <NA> Green P.5 (No hits) P.4 (CRI 001)
#> 6 G.D. Dark brown P.1 (CRI 001) P.6 (No hits)
#> 7 G.D. Brown P.1 (CRI 001) P.1 (CRI 001)
Unsurprisingly, the uniqueness of identifiers p1
and
p2
correspond to the uniqueness of the
initials
and hair_colour
respectively. In this
case both strategies represent different outcomes. For example,
p1
identifies records 3 and 4 as the same person, while
p2
has it as records 4 and 5.
To maximise coverage, links()
can implement an ordered
multistage deterministic approach to record linkage. For example, we can
say that records with matching initials
should be linked to
each other, then other records with a matching hair_colour
should then be added to the group.
$p3 <- links(criteria = as.list(missing_staff_id[c("initials", "hair_colour")]))
missing_staff_idc("initials", "hair_colour", "p1", "p2", "p3")]
missing_staff_id[#> initials hair_colour p1 p2 p3
#> 1 G.D. Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 B.G. Teal P.2 (No hits) P.2 (No hits) P.2 (No hits)
#> 3 X.P. <NA> P.3 (CRI 001) P.3 (No hits) P.3 (CRI 001)
#> 4 X.P. Green P.3 (CRI 001) P.4 (CRI 001) P.3 (CRI 001)
#> 5 <NA> Green P.5 (No hits) P.4 (CRI 001) P.3 (CRI 002)
#> 6 G.D. Dark brown P.1 (CRI 001) P.6 (No hits) P.1 (CRI 001)
#> 7 G.D. Brown P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
We see that p3
now identifies records 3, 4 and 5 as the
same person. The logic here is that since record 4 has the same
initial
as record 3 and also has the same
hair_colour
as record 5, all three are linked together as
records of the same entity. Records 3 and 5 are only considered part of
the same entity because of their share link with record 4. If record 4
is removed from this dataset, records 3 and 5 will considered part of
separate entities. This process is referred to as record expansion.
During record expansion the following rules are implemented.
tie_sort
argument.At each stage, additional matching criteria can be specified. This is
done through a sub_criteria
object. This is an
S3
class containing attributes to be compared and functions
for the comparisons. A sub_criteria
object is used for
evaluated, fuzzy and/or nested matches.
For example, we can compare hair_colour
and
branch_office
without any order (priority) to them. This is
the equivalent of saying matching hair colour OR/AND
branch
office.
<- sub_criteria(missing_staff_id$hair_colour,
scri_1 $branch_office,
missing_staff_idoperator = "or")
<- sub_criteria(missing_staff_id$hair_colour,
scri_2 $branch_office,
missing_staff_idoperator = "and")
$p4 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_1),
recursive = TRUE)
$p5 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_2),
recursive = TRUE)
c("hair_colour", "branch_office", "p4", "p5")]
missing_staff_id[#> hair_colour branch_office p4 p5
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.2 (CRI 001) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001)
There is no limit to the number of sub_criteria
that can
be specified but each sub_criteria
must be paired to a
criteria
. Any unpaired sub_criteria
will be
ignored.
As mentioned, a sub_criteria
can be nested. For example,
scri_3
below is the equivalent of saying
(scri_1
; matching hair colour OR
branch
office) AND
(matching initials OR
branch
office).
<- sub_criteria(scri_1,
scri_3 sub_criteria(missing_staff_id$initials,
$branch_office,
missing_staff_idoperator = "or"),
operator = "and")
$p6 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_3),
recursive = TRUE)
c("hair_colour", "branch_office", "p4", "p5", "p6")]
missing_staff_id[#> hair_colour branch_office p4 p5 p6
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits) P.2 (CRI 001)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.2 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits) P.2 (CRI 001)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
Evaluated matches can be implemented with user-defined functions. The only requirement is that they:
x
and y
,
where y
is the value for one observation being compared
against the value of all other observations - x
.TRUE
or FALSE
.For example, there are variations of the same
hair_colour
and branch_office
values in
missing_staff_id
. A quick look and we see that using the
last word of each value will improve the linkage result. We can pass a
custom function to the sub_criteria
object that will make
this comparison. After doing this below (p7
), we see that
record 6 has now been linked with records 1 and 7, which was not the
case earlier.
# A function to extract the last word in a string
<- function(x) tolower(gsub("^.* ", "", x))
last_word_wf # A logical test using `last_word_wf`.
<- function(x, y) last_word_wf(x) == last_word_wf(y)
last_word_cmp
<- sub_criteria(missing_staff_id$hair_colour,
scri_4 $branch_office,
missing_staff_idmatch_funcs = c(last_word_cmp, last_word_cmp),
operator = "or")
$p7 <- links(criteria = "place_holder",
missing_staff_idsub_criteria = list(cr1 = scri_4),
recursive = TRUE)
c("hair_colour", "branch_office", "p4", "p5", "p6", "p7")]
missing_staff_id[#> hair_colour branch_office p4 p5 p6
#> 1 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> 2 Teal France P.2 (CRI 001) P.2 (No hits) P.2 (CRI 001)
#> 3 <NA> <NA> P.3 (No hits) P.3 (No hits) P.3 (No hits)
#> 4 Green <NA> P.2 (CRI 001) P.4 (No hits) P.4 (No hits)
#> 5 Green France P.2 (CRI 001) P.5 (No hits) P.2 (CRI 001)
#> 6 Dark brown Ghana P.6 (No hits) P.6 (No hits) P.6 (No hits)
#> 7 Brown Republic of Ghana P.1 (CRI 001) P.1 (CRI 001) P.1 (CRI 001)
#> p7
#> 1 P.1 (CRI 001)
#> 2 P.2 (CRI 001)
#> 3 P.3 (No hits)
#> 4 P.2 (CRI 001)
#> 5 P.2 (CRI 001)
#> 6 P.1 (CRI 001)
#> 7 P.1 (CRI 001)
A sub_criteria
provides a lot flexibly in terms of how
attributes are compared however, it can cost a lot in processing time.
This is because links()
is an iterative function, comparing
batches of record-pairs in iterations. This keeps memory usage down but
at the cost of processing time. There are ways to mitigate this such as
using the recursive
and check_duplicate
arguments.
In contrast to this, links_wf_probabilistic()
creates
every possible record-pair of values for each attribute, which are then
compared. This is often faster but costs more in memory usage. With a
large enough dataset, we will eventually run out of memory, especially
when there is no blocking_attribute
. Below we see how
links_wf_probabilistic()
is used for the same fuzzy
matching.
<- links_wf_probabilistic(attribute = list(missing_staff_id$hair_colour,
p8 $branch_office),
missing_staff_idcmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = FALSE)
p8#> [1] "P.1 (CRI 001)" "P.2 (CRI 001)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.2 (CRI 001)" "P.1 (CRI 001)" "P.1 (CRI 001)"
It’s worth noting that links_wf_probabilistic()
does not
have the step-wise or nested matching features of links()
however, merge_ids()
can be used to achieve the same result
by consolidating the result of multiple instances of
links_wf_probabilistic()
(pid
objects) in the
same way links()
would.
The advantage of links_wf_probabilistic()
over
links()
is it’s ability to implement a probabilistic
approach to record linkage. Its implementation of probabilistic record
linkage is based on Fellegi and Sunter (1969) model for deciding if two
records belong to the same entity. In summary,
m_probabilities
and u_probabilities
, which are
the probabilities of a true and false match respectively are used to
calculate a final match score for each record-pair. Records below or
above a certain score_threshold
are considered matches or
non-matches respectively. See help(links_wf_probabilistic)
for a more detailed explanation of the method. Below we see the same
analysis as above but as a probabilistic record linkage.
<- links_wf_probabilistic(attribute = list(missing_staff_id$hair_colour,
p9 $branch_office),
missing_staff_idcmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = TRUE)
p9#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
As mentioned, with a sufficiently large dataset, we can easily run
out of memory. If memory usage is a concern, then use
links_wf_probabilistic()
instead.
links_wf_probabilistic()
is a wrapper function of
links()
with a specific sub_criteria
to
achieve probabilistic record linkage. It’s less memory intensive but is
usually slower. The reasons for this has already been described
above.
<- links_wf_probabilistic(attribute = list(missing_staff_id$hair_colour,
p10 $branch_office),
missing_staff_idcmp_func = c(last_word_cmp, last_word_cmp),
probabilistic = TRUE,
recursive = TRUE)
p10#> [1] "P.1 (CRI 001)" "P.2 (No hits)" "P.3 (No hits)" "P.4 (No hits)"
#> [5] "P.5 (No hits)" "P.1 (CRI 001)" "P.1 (CRI 001)"
links()
and links_wf_probabilistic()
can
perform record linkage across multiple datasets in one instance. To
achieve this, values from the different datasets should be passed to the
functions as single atomic vectors. The corresponding identifier for
each dataset should then be passed to the data_source
argument. The strata
argument is a related feature which
provides a blocking attribute, limiting the linkage process to subsets
of the dataset. See a demonstration of these below.
<- rbind(missing_staff_id[c(4:5, 7)],
triplicate c(4:5, 7)],
missing_staff_id[c(4:5, 7)])
missing_staff_id[$data_source <- c(rep("set_1", 7), rep("set_2", 7), rep("set_3", 7))
triplicate
$p1 <- links(as.list(triplicate[1:2]),
triplicatedata_source = triplicate$data_source,
strata = triplicate$source_1)
$p2 <- links(as.list(triplicate[1:2]),
triplicatestrata = triplicate$data_source,
data_source = triplicate$source_1)
triplicate#> initials hair_colour source_1 data_source p1 p2
#> 1 G.D. Brown A set_1 P.1 (CRI 001) P.01 (CRI 001)
#> 2 B.G. Teal A set_1 P.2 (CRI 001) P.02 (No hits)
#> 3 X.P. <NA> A set_1 P.3 (CRI 001) P.03 (CRI 001)
#> 4 X.P. Green B set_1 P.4 (CRI 001) P.03 (CRI 001)
#> 5 <NA> Green A set_1 P.5 (CRI 002) P.03 (CRI 002)
#> 6 G.D. Dark brown A set_1 P.1 (CRI 001) P.01 (CRI 001)
#> 7 G.D. Brown B set_1 P.7 (CRI 001) P.01 (CRI 001)
#> 8 G.D. Brown A set_2 P.1 (CRI 001) P.08 (CRI 001)
#> 9 B.G. Teal A set_2 P.2 (CRI 001) P.09 (No hits)
#> 10 X.P. <NA> A set_2 P.3 (CRI 001) P.10 (CRI 001)
#> 11 X.P. Green B set_2 P.4 (CRI 001) P.10 (CRI 001)
#> 12 <NA> Green A set_2 P.5 (CRI 002) P.10 (CRI 002)
#> 13 G.D. Dark brown A set_2 P.1 (CRI 001) P.08 (CRI 001)
#> 14 G.D. Brown B set_2 P.7 (CRI 001) P.08 (CRI 001)
#> 15 G.D. Brown A set_3 P.1 (CRI 001) P.15 (CRI 001)
#> 16 B.G. Teal A set_3 P.2 (CRI 001) P.16 (No hits)
#> 17 X.P. <NA> A set_3 P.3 (CRI 001) P.17 (CRI 001)
#> 18 X.P. Green B set_3 P.4 (CRI 001) P.17 (CRI 001)
#> 19 <NA> Green A set_3 P.5 (CRI 002) P.17 (CRI 002)
#> 20 G.D. Dark brown A set_3 P.1 (CRI 001) P.15 (CRI 001)
#> 21 G.D. Brown B set_3 P.7 (CRI 001) P.15 (CRI 001)
The results are different as per the different linkage strategy. We
can use as.data.frame
or as.list
to access the
contents of each identifier to inspect the difference.
as.data.frame(triplicate$p1)
#> pid sn pid_cri link_id pid_total iteration pid_dataset
#> 1 1 1 1 1 6 1 set_1,set_2,set_3
#> 2 2 2 1 2 3 1 set_1,set_2,set_3
#> 3 3 3 1 3 3 1 set_1,set_2,set_3
#> 4 4 4 1 4 3 1 set_1,set_2,set_3
#> 5 5 5 2 5 3 2 set_1,set_2,set_3
#> 6 1 6 1 1 6 1 set_1,set_2,set_3
#> 7 7 7 1 7 3 1 set_1,set_2,set_3
#> 8 1 8 1 1 6 1 set_1,set_2,set_3
#> 9 2 9 1 2 3 1 set_1,set_2,set_3
#> 10 3 10 1 3 3 1 set_1,set_2,set_3
#> 11 4 11 1 4 3 1 set_1,set_2,set_3
#> 12 5 12 2 5 3 2 set_1,set_2,set_3
#> 13 1 13 1 1 6 1 set_1,set_2,set_3
#> 14 7 14 1 7 3 1 set_1,set_2,set_3
#> 15 1 15 1 1 6 1 set_1,set_2,set_3
#> 16 2 16 1 2 3 1 set_1,set_2,set_3
#> 17 3 17 1 3 3 1 set_1,set_2,set_3
#> 18 4 18 1 4 3 1 set_1,set_2,set_3
#> 19 5 19 2 5 3 2 set_1,set_2,set_3
#> 20 1 20 1 1 6 1 set_1,set_2,set_3
#> 21 7 21 1 7 3 1 set_1,set_2,set_3
as.data.frame(triplicate$p2)
#> pid sn pid_cri link_id pid_total iteration pid_dataset
#> 1 1 1 1 1 3 1 A,B
#> 2 2 2 0 2 1 2 A
#> 3 3 3 1 3 3 1 A,B
#> 4 3 4 1 3 3 1 A,B
#> 5 3 5 2 4 3 2 A,B
#> 6 1 6 1 1 3 1 A,B
#> 7 1 7 1 1 3 1 A,B
#> 8 8 8 1 8 3 1 A,B
#> 9 9 9 0 9 1 2 A
#> 10 10 10 1 10 3 1 A,B
#> 11 10 11 1 10 3 1 A,B
#> 12 10 12 2 11 3 2 A,B
#> 13 8 13 1 8 3 1 A,B
#> 14 8 14 1 8 3 1 A,B
#> 15 15 15 1 15 3 1 A,B
#> 16 16 16 0 16 1 2 A
#> 17 17 17 1 17 3 1 A,B
#> 18 17 18 1 17 3 1 A,B
#> 19 17 19 2 18 3 2 A,B
#> 20 15 20 1 15 3 1 A,B
#> 21 15 21 1 15 3 1 A,B