Manually specifying FFTs

Manually specifying FFTs

We usually create fast-and-frugal trees (FFTs) from data by using the FFTrees() function (see the Main guide: FFTrees overview and the vignette on Creating FFTs with FFTrees() for instructions). However, we occasionally may want to design and evaluate specific FFTs (e.g., to test a hypothesis or include or exclude some variables based on theoretical considerations).

There are two ways to manually define fast-and-frugal trees with the FFTrees() function:

1. as a sentence using the my.tree argument (the easier way), or

2. as a data frame using the tree.definitions argument (the harder way).

Both of these methods require some data to evaluate the performance of FFTs, but will bypass the tree construction algorithms built into the FFTrees package. As manually created FFTs are not optimized for specific data, the key difference between fitting and predicting disappears for such FFTs. Although we can still use two sets of ‘train’ vs. ‘test’ data, a manually defined FFT is not fitted and hence should not be expected to perform systematically better on ‘train’ data than on ‘test’ data.

1. Using my.tree

The first method for manually defining a FFT is to use the my.tree argument, where my.tree is a sentence describing a (single) FFT. When this argument is specified in FFTrees(), the function — or rather its auxiliary fftrees_wordstofftrees() function — attempts to interpret the verbal description and convert it into a valid definition of an FFT (as part of an FFTrees object).

For example, let’s use the heartdisease data to find out how some predictor variables (e.g., sex, age, etc.) predict the criterion variable (diagnosis):

Table 1: Five cues and the binary criterion variable diagnosis for the first cases of the heartdisease data.
sex age thal cp ca diagnosis
1 63 fd ta 0 FALSE
1 67 normal a 3 TRUE
1 67 rd a 2 TRUE
1 37 normal np 0 FALSE
0 41 normal aa 0 FALSE
1 56 normal aa 0 FALSE

Here’s how we could verbally describe an FFT by using the first three cues in conditional sentences:

in_words <- "If sex = 1, predict True.
If age < 45, predict False.
If thal = {fd, normal}, predict True.
Otherwise, predict False."

As we will see shortly, the FFTrees() function accepts such descriptions (assigned here to a character string in_words) as its my.tree argument, create a corresponding FFT, and evaluate it on a corresponding dataset.

Verbally defining FFTs

Here are some instructions for manually specifying trees:

• Each node must start with the word “If” and should correspond to the form: If <CUE> <DIRECTION> <THRESHOLD>, predict <EXIT>.

• Numeric thresholds should be specified directly (without brackets), like age > 21.

• For categorical variables, factor thresholds must be specified within curly braces, like sex = {male}. For factors with sets of values, categories within a threshold should be separated by commas like eyecolor = {blue,brown}.

• To specify cue directions, standard logical comparisons =, !=, <, >= (etc.) are valid. For numeric cues, only use >, >=, <, or <=. For factors, only use = or !=.

• Positive exits are indicated by True, while negative exits are specified by False.

• The final node of an FFT is always bi-directional (i.e., has both a positive and a negative exit). The description of the final node always mentions its positive (True) exit first. The text Otherwise, predict EXIT that we have included in the example above is actually not necessary (and ignored).

Example

Now, let’s use our verbal description of an FFT (assigned to in_words above) as the my.tree argument of the FFTrees() function. This creates a corresponding FFT and applies it to the heartdisease data:

# Create FFTrees from a verbal FFT description (as my.tree):
my_fft <- FFTrees(formula = diagnosis ~.,
data = heartdisease,
main = "My 1st FFT",
my.tree = in_words)

Running FFTrees() with the my.tree argument creates an object my_fft that contains one FFT. A verbal description of this tree can be printed by inwords(my_fft), but we want to print or plot the object to evaluate the tree’s performance on training or testing data. Let’s see how well our manually constructed FFT (my_fft) did on the training data:

# Inspect FFTrees object:
plot(my_fft, data = "train")

When manually constructing a tree, the resulting FFTrees object only contains a single FFT. Hence, the ROC plot (in the right bottom panel of Figure 1) cannot show a range of FFTs, but locates the constructed FFT in ROC space.

The formal definition of our new FFT is available from the FFTrees object my_fft:

# Get FFT definition(s):
my_fft$trees$definitions
#> # A tibble: 1 × 7
#>    tree nodes classes cues         directions thresholds     exits
#>   <int> <int> <chr>   <chr>        <chr>      <chr>          <chr>
#> 1     1     3 n;n;c   sex;age;thal =;>=;=     1;45;fd,normal 1;0;.5

Note that the 2nd node in this FFT (using the age cue) is predicting the noise outcome (i.e., a non-final exit value of 0 or FALSE, shown to the left). As our tree definitions always refer to the signal outcome (i.e., a non-final exit value of 1 or TRUE, shown to the right), the direction symbol of a left exit (i.e., the 2nd node in Figure 1: if age < 45, predict 0 or noise) must be flipped relative to its appearance in the tree definition (if age >= 45, predict 1 or signal). Thus, the plot and the formal definition describe the same FFT.

As it turns out, the performance of our first FFT created from a verbal description is a mixed affair: The tree has a rather high sensitivity (of 91%), but its low specificity (of only 10%) allows for many false alarms. Consequently, its accuracy measures are only around baseline level.

Creating an alternative FFT

Let’s see if we can come up with a better FFT. The following example uses the cues thal, cp, and ca in the my.tree argument:

# Create a 2nd FFT from an alternative FFT description (as my.tree):
my_fft_2 <- FFTrees(formula = diagnosis ~.,
data = heartdisease,
main = "My 2nd FFT",
my.tree = "If thal = {rd,fd}, predict True.
If cp != {a}, predict False.
If ca > 1, predict True.
Otherwise, predict False.")

As FFTrees aims to interpret the my.tree argument to the best of its abilities, there is some flexibility in entering a verbal description of an FFT. For instance, we also could have described our desired FFT in more flowery terms:

# Create a 2nd FFT from an alternative FFT description (as my.tree):
my_fft_2 <- FFTrees(formula = diagnosis ~.,
data = heartdisease,
main = "My 2nd FFT",
my.tree = "If thal equals {rd,fd}, we shall say True.
If Cp differs from {a}, let's predict False.
If CA happens to exceed 1, we insist on True.
Else, give up and go away.") 

However, as the vocabulary of FFTrees is limited, it is safer to enter cue directions in their symbolic form (i.e., using =, <, <=, >, >=, or !=).1 To verify that FFTrees interpreted our my.tree description, let’s check whether the FFT of inwords(my_fft_2) yields a description that corresponds to our intended FFT:

inwords(my_fft_2)
#> [1] "If thal = {rd,fd}, decide True."
#> [2] "If cp != {a}, decide False."
#> [3] "If ca > 1, decide True, otherwise, decide False."

As this seems (a more prosaic version of) what we wanted, let’s visualize the best training tree (to evaluate its performance) and briefly inspect its tree definition:

# Visualize FFT:
plot(my_fft_2)

# FFT definition:
my_fft_2$trees$definitions
#> # A tibble: 1 × 7
#>    tree nodes classes cues       directions thresholds exits
#>   <int> <int> <chr>   <chr>      <chr>      <chr>      <chr>
#> 1     1     3 c;c;n   thal;cp;ca =;=;>      rd,fd;a;1  1;0;.5
# Note the flipped direction value for 2nd cue (exit = '0'):
# 'if (cp  = a), predict 1' in the tree definition corresponds to
# 'if (cp != a), predict 0' in the my.tree description and plot.  

This alternative FFT is nicely balancing sensitivity and specificity and performs much better overall. Nevertheless, it is still far from perfect — so check out whether you can create even better ones!

2. Using tree.definitions

More experienced users may want to define and evaluate more than one FFTs at a time. To achieve this, the FFTrees() function allows providing sets of tree.definitions (as a data frame). However, as questions regarding specific trees usually arise late in an exploration of FFTs, the tree.definitions argument is mostly used in combination with an existing FFTrees object x. In this case, the parameters (e.g., regarding the formula, data and goals to be used) from x are being used, but its tree definitions (stored in x$trees$definitions) are replaced by those in tree.definitions and the object is re-evaluated for those FFTs.

Example

We illustrate a typical workflow by redefining some FFTs that were built in the Tutorial: FFTs for heart disease and evaluating them on the (full) heartdisease data.

First, we use our default algorithms to create an FFTrees object heart.fft:

# Create an FFTrees object x:
x <- FFTrees(formula = diagnosis ~ .,           # criterion and (all) predictors
data = heart.train,                # training data
data.test = heart.test,            # testing data
main = "Heart Disease 1",          # initial label
decision.labels = c("low risk", "high risk"),  # exit labels
quiet = TRUE)                      # hide user feedback

As we have seen in the Tutorial, evaluating this expression yields a set of 7 FFTs. Rather than evaluating them individually (by issuing print(x) or plot(x) commands to inspect specific trees), we can obtain both their definitions and their performance characteristics on a variety of measures either by running summary(x) or by inspecting corresponding parts of the FFTrees object. For instance, the following alternatives would both show the current definitions of the generated FFTs:

# Tree definitions of x:
# summary(x)$definitions # from summary() x$trees$definitions # from FFTrees object x #> # A tibble: 7 × 7 #> tree nodes classes cues directions thresholds exits #> <int> <int> <chr> <chr> <chr> <chr> <chr> #> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5 #> 2 2 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;0;1;0.5 #> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5 #> 4 4 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;0;0.5 #> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5 #> 6 6 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 0;0;0;0.5 #> 7 7 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;1;0.5 Each line in these tree definitions defines an FFT in the context of our current FFTrees object x (see the vignette on Creating FFTs with FFTrees() for help on interpreting tree definitions). As the “ifan” algorithm responsible for creating these trees yields a family of highly similar FFTs (as the FFTs vary only by their exits, and some truncate the last cue), we may want to examine alternative versions for these trees. Modifying tree definitions To demonstrate how to create and evaluate manual FFT definitions, we copy the existing tree definitions (as a data frame), select three FFTs (rows), and then create a 4th definition (with a different exit structure): # 0. Copy and choose some existing FFT definitions: tree_df <- x$trees$definitions # get FFT definitions (as df) tree_df <- tree_df[c(1, 3, 5), ] # filter 3 particular FFTs # 1. Add a tree with 1;1;0.5 exit structure (a "rake" tree with Signal bias): tree_df[4, ] <- tree_df[1, ] # initialize new FFT #4 (as copy of FFT #1) tree_df$exits[4] <- c("1; 1; 0.5")  # modify exits of FFT #4

tree_df$tree <- 1:nrow(tree_df) # adjust tree numbers # tree_df Moreover, let’s define four additional FFTs that reverse the order of the 1st and 2nd cues. As both cues are categorical (i.e., of class c) and have the same direction (i.e., =), we only need to reverse the thresholds (so that they correspond to the new cue order): # 2. Change cue orders: tree_df[5:8, ] <- tree_df[1:4, ] # 4 new FFTs (as copiess of existing ones) tree_df$cues[5:8] <- "cp; thal; ca"       # modify order of cues
tree_df$thresholds[5:8] <- "a; rd,fd; 0" # modify order of thresholds accordingly tree_df$tree <- 1:nrow(tree_df)           # adjust tree numbers
# tree_df

The resulting data frame tree_df contains the definitions of eight FFTs. The first three are copies of trees in x, but the other five are new.

Evaluating tree.definitions

We can evaluate this set by running the FFTrees() function with the previous FFTrees object x (i.e., with its formula and data settings) and specifying tree_df in the tree.definitions argument:

# Create a modified FFTrees object y:
y <- FFTrees(object = x,                  # use previous FFTrees object x
tree.definitions = tree_df,  # but with new tree definitions
main = "Heart Disease 2"     # revised label
)

The resulting FFTrees object y contains the decisions and summary statistics of all eight FFTs for the data specified in x. Although it is unlikely that one of the newly created trees beats the automatically created FFTs, we find that reversing the order of the first cues has only minimal effects on training accuracy (as measured by bacc):

y$trees$definitions  # tree definitions
#> # A tibble: 8 × 7
#>    tree nodes classes cues         directions thresholds  exits
#>   <int> <int> <chr>   <chr>        <chr>      <chr>       <chr>
#> 1     1     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   1;0;0.5
#> 2     2     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 1;0;0.5
#> 3     3     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   0;1;0.5
#> 4     4     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 0;1;0.5
#> 5     5     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   1; 1; 0.5
#> 6     6     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 1; 1; 0.5
#> 7     7     3 c;c;n   thal;cp;ca   =;=;>      rd,fd;a;0   0;0;0.5
#> 8     8     3 c;c;n   cp; thal; ca =;=;>      a; rd,fd; 0 0;0;0.5
y$trees$stats\$train  # training statistics
#> # A tibble: 8 × 20
#>    tree     n    hi    fa    mi    cr  sens  spec    far   ppv   npv dprime
#>   <int> <int> <int> <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
#> 1     1   150    54    18    12    66 0.818 0.786 0.214  0.75  0.846   1.69
#> 2     2   150    55    20    11    64 0.833 0.762 0.238  0.733 0.853   1.66
#> 3     3   150    44     7    22    77 0.667 0.917 0.0833 0.863 0.778   1.79
#> 4     4   150    44     7    22    77 0.667 0.917 0.0833 0.863 0.778   1.79
#> 5     5   150    63    42     3    42 0.955 0.5   0.5    0.6   0.933   1.66
#> 6     6   150    63    42     3    42 0.955 0.5   0.5    0.6   0.933   1.66
#> 7     7   150    28     2    38    82 0.424 0.976 0.0238 0.933 0.683   1.74
#> 8     8   150    28     2    38    82 0.424 0.976 0.0238 0.933 0.683   1.74
#> # … with 8 more variables: acc <dbl>, bacc <dbl>, wacc <dbl>, cost_dec <dbl>,
#> #   cost_cue <dbl>, cost <dbl>, pci <dbl>, mcu <dbl>

Note that the trees in y were sorted by their performance on the current goal (here bacc). For instance, the new rake tree with cue order cp; thal; ca and exits 1; 1; 0.5 is now FFT #6. When examining its performance on "test" data (i.e., for prediction):

# Print and plot FFT #6:
print(y, tree = 6, data = "test")
plot(y,  tree = 6, data = "test")

we see that it has a balanced accuracy bacc of 70%. More precisely, its bias for predicting disease (i.e., signal or True) yields near-perfect sensitivity (96%), but very poor specificity (44%).

If we wanted to change more aspects of x (e.g., use different data or goal settings), we could have created a new FFTrees object without supplying the previous object x, as long as the FFTs defined in tree.definitions fit to the settings of formula and data.

Vignettes

Here is a complete list of the vignettes available in the FFTrees package:

Vignette Description
Main guide: FFTrees overview An overview of the FFTrees package
1 Tutorial: FFTs for heart disease An example of using FFTrees() to model heart disease diagnosis
2 Accuracy statistics Definitions of accuracy statistics used throughout the package
3 Creating FFTs with FFTrees() Details on the main FFTrees() function
4 Manually specifying FFTs How to directly create FFTs without using the built-in algorithms
5 Visualizing FFTs Plotting FFTrees objects, from full trees to icon arrays
6 Examples of FFTs Examples of FFTs from different datasets contained in the package

1. Unambiguous my.tree descriptions must avoid using “is” and “is not” without additional qualifications (like “equal”, “different”, “larger”, “smaller”, etc.).↩︎