We usually create fast-and-frugal trees (FFTs) from data by using the
FFTrees()
function (see the Main
guide: FFTrees overview and the vignette on Creating FFTs with FFTrees() for
instructions). However, we occasionally may want to design and evaluate
specific FFTs (e.g., to test a hypothesis or include or exclude some
variables based on theoretical considerations).
There are two ways to manually define fast-and-frugal trees with the
FFTrees()
function:
as a sentence using the my.tree
argument (the easier
way), or
as a data frame using the tree.definitions
argument
(the harder way).
Both of these methods require some data to evaluate the performance of FFTs, but will bypass the tree construction algorithms built into the FFTrees package. As manually created FFTs are not optimized for specific data, the key difference between fitting and predicting disappears for such FFTs. Although we can still use two sets of ‘train’ vs. ‘test’ data, a manually defined FFT is not fitted and hence should not be expected to perform systematically better on ‘train’ data than on ‘test’ data.
my.tree
The first method for manually defining a FFT is to use the
my.tree
argument, where my.tree
is a sentence
describing a (single) FFT. When this argument is specified in
FFTrees()
, the function — or rather its auxiliary
fftrees_wordstofftrees()
function — attempts to interpret
the verbal description and convert it into a valid definition of an FFT
(as part of an FFTrees
object).
For example, let’s use the heartdisease
data to find out
how some predictor variables (e.g., sex
, age
,
etc.) predict the criterion variable (diagnosis
):
sex | age | thal | cp | ca | diagnosis |
---|---|---|---|---|---|
1 | 63 | fd | ta | 0 | FALSE |
1 | 67 | normal | a | 3 | TRUE |
1 | 67 | rd | a | 2 | TRUE |
1 | 37 | normal | np | 0 | FALSE |
0 | 41 | normal | aa | 0 | FALSE |
1 | 56 | normal | aa | 0 | FALSE |
Here’s how we could verbally describe an FFT by using the first three cues in conditional sentences:
<- "If sex = 1, predict True.
in_words If age < 45, predict False.
If thal = {fd, normal}, predict True.
Otherwise, predict False."
As we will see shortly, the FFTrees()
function accepts
such descriptions (assigned here to a character string
in_words
) as its my.tree
argument, create a
corresponding FFT, and evaluate it on a corresponding dataset.
Here are some instructions for manually specifying trees:
Each node must start with the word “If” and should correspond to
the form:
If <CUE> <DIRECTION> <THRESHOLD>, predict <EXIT>
.
Numeric thresholds should be specified directly (without
brackets), like age > 21
.
For categorical variables, factor thresholds must be specified
within curly braces, like sex = {male}
. For factors with
sets of values, categories within a threshold should be separated by
commas like eyecolor = {blue,brown}
.
To specify cue directions, standard logical comparisons
=
, !=
, <
, >=
(etc.) are valid. For numeric cues, only use >
,
>=
, <
, or <=
. For
factors, only use =
or !=
.
Positive exits are indicated by True
, while negative
exits are specified by False
.
The final node of an FFT is always bi-directional (i.e., has both
a positive and a negative exit). The description of the final node
always mentions its positive (True
) exit first. The text
Otherwise, predict EXIT
that we have included in the
example above is actually not necessary (and ignored).
Now, let’s use our verbal description of an FFT (assigned to
in_words
above) as the my.tree
argument of the
FFTrees()
function. This creates a corresponding FFT and
applies it to the heartdisease
data:
# Create FFTrees from a verbal FFT description (as my.tree):
<- FFTrees(formula = diagnosis ~.,
my_fft data = heartdisease,
main = "My 1st FFT",
my.tree = in_words)
Running FFTrees()
with the my.tree
argument
creates an object my_fft
that contains one FFT. A verbal
description of this tree can be printed by inwords(my_fft)
,
but we want to print or plot the object to evaluate the tree’s
performance on training or testing data. Let’s see how well our manually
constructed FFT (my_fft
) did on the training data:
# Inspect FFTrees object:
plot(my_fft, data = "train")
When manually constructing a tree, the resulting FFTrees
object only contains a single FFT. Hence, the ROC plot (in the right
bottom panel of Figure 1) cannot show a range of FFTs,
but locates the constructed FFT in ROC space.
The formal definition of our new FFT is available from the
FFTrees
object my_fft
:
# Get FFT definition(s):
$trees$definitions my_fft
#> # A tibble: 1 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 n;n;c sex;age;thal =;>=;= 1;45;fd,normal 1;0;.5
Note that the 2nd node in this FFT (using the age
cue)
is predicting the noise outcome (i.e., a non-final exit value
of 0
or FALSE
, shown to the left). As our tree
definitions always refer to the signal outcome (i.e., a
non-final exit value of 1
or TRUE
, shown to
the right), the direction symbol of a left exit (i.e., the 2nd node in
Figure 1: if age < 45
,
predict 0
or noise) must be flipped relative to its
appearance in the tree definition (if age >= 45
,
predict 1
or signal). Thus, the plot and the formal
definition describe the same FFT.
As it turns out, the performance of our first FFT created from a verbal description is a mixed affair: The tree has a rather high sensitivity (of 91%), but its low specificity (of only 10%) allows for many false alarms. Consequently, its accuracy measures are only around baseline level.
Let’s see if we can come up with a better FFT. The following example
uses the cues thal
, cp
, and ca
in
the my.tree
argument:
# Create a 2nd FFT from an alternative FFT description (as my.tree):
<- FFTrees(formula = diagnosis ~.,
my_fft_2 data = heartdisease,
main = "My 2nd FFT",
my.tree = "If thal = {rd,fd}, predict True.
If cp != {a}, predict False.
If ca > 1, predict True.
Otherwise, predict False.")
As FFTrees aims to interpret the
my.tree
argument to the best of its abilities, there is
some flexibility in entering a verbal description of an FFT. For
instance, we also could have described our desired FFT in more flowery
terms:
# Create a 2nd FFT from an alternative FFT description (as my.tree):
<- FFTrees(formula = diagnosis ~.,
my_fft_2 data = heartdisease,
main = "My 2nd FFT",
my.tree = "If thal equals {rd,fd}, we shall say True.
If Cp differs from {a}, let's predict False.
If CA happens to exceed 1, we insist on True.
Else, give up and go away.")
However, as the vocabulary of FFTrees is limited, it
is safer to enter cue directions in their symbolic form (i.e.,
using =
, <
, <=
,
>
, >=
, or !=
).^{1} To verify
that FFTrees interpreted our my.tree
description, let’s check whether the FFT of
inwords(my_fft_2)
yields a description that corresponds to
our intended FFT:
inwords(my_fft_2)
#> [1] "If thal = {rd,fd}, decide True."
#> [2] "If cp != {a}, decide False."
#> [3] "If ca > 1, decide True, otherwise, decide False."
As this seems (a more prosaic version of) what we wanted, let’s visualize the best training tree (to evaluate its performance) and briefly inspect its tree definition:
# Visualize FFT:
plot(my_fft_2)
# FFT definition:
$trees$definitions
my_fft_2#> # A tibble: 1 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;1 1;0;.5
# Note the flipped direction value for 2nd cue (exit = '0'):
# 'if (cp = a), predict 1' in the tree definition corresponds to
# 'if (cp != a), predict 0' in the my.tree description and plot.
This alternative FFT is nicely balancing sensitivity and specificity and performs much better overall. Nevertheless, it is still far from perfect — so check out whether you can create even better ones!
tree.definitions
More experienced users may want to define and evaluate more than one
FFTs at a time. To achieve this, the FFTrees()
function
allows providing sets of tree.definitions
(as a data
frame). However, as questions regarding specific trees usually arise
late in an exploration of FFTs, the tree.definitions
argument is mostly used in combination with an existing
FFTrees
object x
. In this case, the parameters
(e.g., regarding the formula
, data
and goals
to be used) from x
are being used, but its tree definitions
(stored in x$trees$definitions
) are replaced by those in
tree.definitions
and the object is re-evaluated for those
FFTs.
We illustrate a typical workflow by redefining some FFTs that were
built in the Tutorial: FFTs for heart
disease and evaluating them on the (full) heartdisease
data.
First, we use our default algorithms to create an
FFTrees
object heart.fft
:
# Create an FFTrees object x:
<- FFTrees(formula = diagnosis ~ ., # criterion and (all) predictors
x data = heart.train, # training data
data.test = heart.test, # testing data
main = "Heart Disease 1", # initial label
decision.labels = c("low risk", "high risk"), # exit labels
quiet = TRUE) # hide user feedback
As we have seen in the Tutorial,
evaluating this expression yields a set of 7 FFTs. Rather than
evaluating them individually (by issuing print(x)
or
plot(x)
commands to inspect specific trees), we can obtain
both their definitions and their performance characteristics on a
variety of measures either by running summary(x)
or by
inspecting corresponding parts of the FFTrees
object. For
instance, the following alternatives would both show the current
definitions of the generated FFTs:
# Tree definitions of x:
# summary(x)$definitions # from summary()
$trees$definitions # from FFTrees object x x
#> # A tibble: 7 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5
#> 2 2 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;0;1;0.5
#> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5
#> 4 4 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;0;0.5
#> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5
#> 6 6 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 0;0;0;0.5
#> 7 7 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;1;0.5
Each line in these tree definitions defines an FFT in the context of
our current FFTrees
object x
(see the vignette
on Creating FFTs with FFTrees() for
help on interpreting tree definitions). As the “ifan” algorithm
responsible for creating these trees yields a family of highly similar
FFTs (as the FFTs vary only by their exits, and some truncate the last
cue), we may want to examine alternative versions for these trees.
To demonstrate how to create and evaluate manual FFT definitions, we copy the existing tree definitions (as a data frame), select three FFTs (rows), and then create a 4th definition (with a different exit structure):
# 0. Copy and choose some existing FFT definitions:
<- x$trees$definitions # get FFT definitions (as df)
tree_df <- tree_df[c(1, 3, 5), ] # filter 3 particular FFTs
tree_df
# 1. Add a tree with 1;1;0.5 exit structure (a "rake" tree with Signal bias):
4, ] <- tree_df[1, ] # initialize new FFT #4 (as copy of FFT #1)
tree_df[$exits[4] <- c("1; 1; 0.5") # modify exits of FFT #4
tree_df
$tree <- 1:nrow(tree_df) # adjust tree numbers
tree_df# tree_df
Moreover, let’s define four additional FFTs that reverse the order of
the 1st and 2nd cues. As both cues are categorical (i.e., of
class c
) and have the same direction (i.e.,
=
), we only need to reverse the thresholds
(so
that they correspond to the new cue order):
# 2. Change cue orders:
5:8, ] <- tree_df[1:4, ] # 4 new FFTs (as copiess of existing ones)
tree_df[$cues[5:8] <- "cp; thal; ca" # modify order of cues
tree_df$thresholds[5:8] <- "a; rd,fd; 0" # modify order of thresholds accordingly
tree_df
$tree <- 1:nrow(tree_df) # adjust tree numbers
tree_df# tree_df
The resulting data frame tree_df
contains the
definitions of eight FFTs. The first three are copies of trees
in x
, but the other five are new.
tree.definitions
We can evaluate this set by running the FFTrees()
function with the previous FFTrees
object x
(i.e., with its formula
and data
settings) and
specifying tree_df
in the tree.definitions
argument:
# Create a modified FFTrees object y:
<- FFTrees(object = x, # use previous FFTrees object x
y tree.definitions = tree_df, # but with new tree definitions
main = "Heart Disease 2" # revised label
)
The resulting FFTrees
object y
contains the
decisions and summary statistics of all eight FFTs for the data
specified in x
. Although it is unlikely that one of the
newly created trees beats the automatically created FFTs, we find that
reversing the order of the first cues has only minimal effects on
training accuracy (as measured by bacc
):
$trees$definitions # tree definitions y
#> # A tibble: 8 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5
#> 2 2 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 1;0;0.5
#> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5
#> 4 4 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 0;1;0.5
#> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1; 1; 0.5
#> 6 6 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 1; 1; 0.5
#> 7 7 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5
#> 8 8 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 0;0;0.5
$trees$stats$train # training statistics y
#> # A tibble: 8 × 20
#> tree n hi fa mi cr sens spec far ppv npv dprime
#> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 150 54 18 12 66 0.818 0.786 0.214 0.75 0.846 1.69
#> 2 2 150 55 20 11 64 0.833 0.762 0.238 0.733 0.853 1.66
#> 3 3 150 44 7 22 77 0.667 0.917 0.0833 0.863 0.778 1.79
#> 4 4 150 44 7 22 77 0.667 0.917 0.0833 0.863 0.778 1.79
#> 5 5 150 63 42 3 42 0.955 0.5 0.5 0.6 0.933 1.66
#> 6 6 150 63 42 3 42 0.955 0.5 0.5 0.6 0.933 1.66
#> 7 7 150 28 2 38 82 0.424 0.976 0.0238 0.933 0.683 1.74
#> 8 8 150 28 2 38 82 0.424 0.976 0.0238 0.933 0.683 1.74
#> # … with 8 more variables: acc <dbl>, bacc <dbl>, wacc <dbl>, cost_dec <dbl>,
#> # cost_cue <dbl>, cost <dbl>, pci <dbl>, mcu <dbl>
Note that the trees in y
were sorted by their
performance on the current goal
(here bacc
).
For instance, the new rake tree with cue order cp; thal; ca
and exits 1; 1; 0.5
is now FFT #6. When examining its
performance on "test"
data (i.e., for prediction):
# Print and plot FFT #6:
print(y, tree = 6, data = "test")
plot(y, tree = 6, data = "test")
we see that it has a balanced accuracy bacc
of 70%. More
precisely, its bias for predicting disease
(i.e., signal or
True) yields near-perfect sensitivity (96%), but very poor specificity
(44%).
If we wanted to change more aspects of x
(e.g., use
different data
or goal
settings), we could
have created a new FFTrees
object without supplying the
previous object x
, as long as the FFTs defined in
tree.definitions
fit to the settings of
formula
and data
.
Here is a complete list of the vignettes available in the FFTrees package:
Vignette | Description | |
---|---|---|
Main guide: FFTrees overview | An overview of the FFTrees package | |
1 | Tutorial: FFTs for heart disease | An example of using FFTrees() to model
heart disease diagnosis |
2 | Accuracy statistics | Definitions of accuracy statistics used throughout the package |
3 | Creating FFTs with FFTrees() | Details on the main FFTrees()
function |
4 | Manually specifying FFTs | How to directly create FFTs without using the built-in algorithms |
5 | Visualizing FFTs | Plotting FFTrees objects, from full trees
to icon arrays |
6 | Examples of FFTs | Examples of FFTs from different datasets contained in the package |
Unambiguous my.tree
descriptions must avoid
using “is” and “is not” without additional qualifications (like “equal”,
“different”, “larger”, “smaller”, etc.).↩︎