This tutorial on using the FFTrees package follows the examples presented in Phillips et al. (2017) (freely available in html | PDF):
In the following, we explain how to use FFTrees to create, evaluate and visualize FFTs in four simple steps.
We can install FFTrees from CRAN using
install.packages()
. (We only need to do this once.)
# Install the package from CRAN:
install.packages("FFTrees")
To use the package, we first need to load it into your current R
session. We load the package using library()
:
# Load the package:
library(FFTrees)
The FFTrees package contains several vignettes that
guide through the package’s functionality (like this one). To open the
main guide, run FFTrees.guide()
:
# Open the main package guide:
FFTrees.guide()
In this example, we will create FFTs from a heart disease data set.
The training data are in an object called heart.train
, and
the testing data are in an object called heart.test
. For
these data, we will predict diagnosis
, a binary criterion
that indicates whether each patent has or does not have heart disease
(i.e., is at high-risk or low-risk).
To create an FFTrees
object, we use the function
FFTrees()
with two main arguments:
formula
expects a formula indicating a binary
criterion variable as a function of one or more predictor variable(s) to
be considered for the tree. The shorthand
formula = diagnosis ~ .
means to include all predictor
variables.
data
specifies the training data used to construct
the FFTs (which must include the criterion variable).
Here is how we can construct our first FFTs:
# Create an FFTrees object:
<- FFTrees(formula = diagnosis ~ ., # Criterion and (all) predictors
heart.fft data = heart.train, # Training data
data.test = heart.test, # Testing data
main = "Heart Disease", # General label
decision.labels = c("Low-Risk", "High-Risk") # Decision labels (False/True)
)
#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Using default 'cost.outcomes' = (0 1 1 0)
#> — Using default 'cost.cues' = (0 per cue)
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create FFTs with 'ifan' algorithm (chasing 'bacc'):
#> Aiming to rank 13 cues (optimizing 'bacc'):
#> Successfully ranked 13 cues.
#> Successfully created 7 FFTs with 'ifan' algorithm.
#> Successfully defined 7 FFTs.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to rank FFTs by 'train' data:
#> Successfully ranked FFTs by 'train' data.
#> Aiming to apply FFTs to 'test' data:
#> Successfully applied FFTs to 'test' data.
#> Aiming to express FFTs in words:
#> Successfully expressed FFTs in words.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
Evaluating this expression runs code that examines the data,
optimizes thresholds based on our current goals for each cue, and
creates and evaluates 7 FFTs. The resulting FFTrees
object
that contains the tree definitions, their decisions, and their
performance statistics, are assigned to the
heart.fft
object.
algorithm
: There are two different algorithms
available to build FFTs "ifan"
(Phillips et al., 2017) and "dfan"
(Phillips et al., 2017).
("max"
(Martignon et al.,
2008), and "zigzag"
(Martignon et al., 2008) are no longer
supported).
max.levels
: Changes the maximum number of levels
that are allowed in the tree.
The following arguments apply when using the “ifan” or “dfan” algorithms for creating new FFTs:
goal.chase
: The goal.chase
argument
changes which statistic is maximized during tree construction (for the
"ifan"
and "dfan"
algorithms). Possible
arguments include "acc"
, "bacc"
,
"wacc"
, "dprime"
, and "cost"
. The
default is "wacc"
with a sensitivity weight of 0.50 (which
renders it identical to "bacc"
).
goal
: The goal
argument changes which
statistic is maximized when selecting trees after construction
(for the "ifan"
and "dfan"
algorithms).
Possible arguments include "acc"
, "bacc"
,
"wacc"
, "dprime"
, and
"cost"
.
my.tree
or tree.definitions
: We can
define a new tree from a verbal description (as a set of sentences), or
manually specify sets of FFTs as a data frame (in appropriate format).
See the Manually specifying FFTs
vignette for details.
Now we can inspect and summarize the generated decision trees. We
will start by printing the FFTrees
object to return basic
information to the console:
# Print an FFTrees object:
heart.fft
#> Heart Disease
#> FFTrees
#> - Trees: 7 fast-and-frugal trees predicting diagnosis
#> - Outcome costs: [hi = 0, fa = 1, mi = 1, cr = 0]
#>
#> FFT #1: Definition
#> [1] If thal = {rd,fd}, decide High-Risk.
#> [2] If cp != {a}, decide Low-Risk.
#> [3] If ca > 0, decide High-Risk, otherwise, decide Low-Risk.
#>
#> FFT #1: Training Accuracy
#> Training data: N = 150, Pos (+) = 66 (44%)
#>
#> | | True + | True - | Totals:
#> |----------|--------|--------|
#> | Decide + | hi 54 | fa 18 | 72
#> | Decide - | mi 12 | cr 66 | 78
#> |----------|--------|--------|
#> Totals: 66 84 N = 150
#>
#> acc = 80.0% ppv = 75.0% npv = 84.6%
#> bacc = 80.2% sens = 81.8% spec = 78.6%
#>
#> FFT #1: Training Speed, Frugality, and Cost
#> mcu = 1.74, pci = 0.87, E(cost) = 0.200
The output tells us several pieces of information:
The tree with the highest weighted sensitivity wacc
with a sensitivity weight of 0.5 is selected as the best tree.
Here, the best tree, FFT #1 uses three cues: thal
,
cp
, and ca
.
Several summary statistics for this tree in training and test data are summarized.
All statistics to evaluate each tree can be derived from a 2 x 2 confusion table:
For definitions of all accuracy statistics, see the accuracy statistics vignette.
Use plot()
to visualize an FFT (an FFTrees
object):
# Plot predictions of the best FFT when applied to test data:
plot(heart.fft, # An FFTrees object
data = "test") # data to use (i.e., either "train" or "test")?
tree
: Which tree in the object should beplotted? To
plot a tree other than the best fitting tree (FFT #1), just specify
another tree as an integer (e.g.;
plot(heart.fft, tree = 2)
).
data
: For which dataset should statistics be shown?
Either data = "train"
(showing fitting or “Training”
performance by default), or data = "test"
(showing
prediction or “Testing” performance).
stats
: Should accuracy statistics be shown with the
tree? To show only the tree, without any performance statistics, include
the argument stats = FALSE
.
# Plot only the tree, without accuracy statistics:
plot(heart.fft, what = "tree")
# plot(heart.fft, stats = FALSE) # The 'stats' argument has been deprecated.
comp
: Should statistics from competitive algorithms
be shown in the ROC curve? To remove the performance statistics of
competitive algorithms (e.g.; regression, random forests), include the
argument comp = FALSE
.
what
: To show individual cue accuracies (in ROC
space), include the argument what = "cues"
:
# Plot cue accuracies (for training data) in ROC space:
plot(heart.fft, what = "cues")
#> Plotting cue training statistics:
#> — Cue accuracies ranked by bacc
See the Plotting FFTrees vignette for details on plotting FFTs.
An FFTrees
object contains many different outputs, to
see them all, run names()
# Show the names of all of the outputs in heart.fft:
names(heart.fft)
#> [1] "criterion_name" "cue_names" "formula" "trees"
#> [5] "data" "params" "competition" "cues"
To predict classifications for a new dataset, use the standard
predict()
function. For example, here’s how to predict the
classifications for data in the heartdisease
object (which
actually is just a combination of heart.train
and
heart.test
):
# Predict classifications for a new dataset:
predict(heart.fft,
newdata = heartdisease)
To define a specific FFT and apply it to data, we can define a tree
by providing its verbal description to the my.tree
argument:
# Create an FFT manually (from description):
<- FFTrees(formula = diagnosis ~.,
my.heart.fft data = heart.train,
data.test = heart.test,
main = "My Heart FFT",
my.tree = "If chol > 350, predict True.
If cp != {a}, predict False.
If age <= 35, predict False, otherwise, predict True.")
Running this code evaluates my.tree
for the specified
sets of data. A visualization of the resulting tree shows its
performance summary (for the training data):
plot(my.heart.fft, data = "train")
The resulting tree is actually not too bad, although its first node is pretty useless (as it only classifies 3 cases, all as false alarms). Thus, omitting the first node will result in an even simpler FFT that cannot be worse. Feel free to verify this — and see the Manually specifying FFTs vignette for additional details on defining FFTs from verbal or abstract descriptions.
Here is a complete list of the vignettes available in the FFTrees package:
Vignette | Description | |
---|---|---|
Main guide: FFTrees overview | An overview of the FFTrees package | |
1 | Tutorial: FFTs for heart disease | An example of using FFTrees() to model
heart disease diagnosis |
2 | Accuracy statistics | Definitions of accuracy statistics used throughout the package |
3 | Creating FFTs with FFTrees() | Details on the main FFTrees()
function |
4 | Manually specifying FFTs | How to directly create FFTs without using the built-in algorithms |
5 | Visualizing FFTs | Plotting FFTrees objects, from full trees
to icon arrays |
6 | Examples of FFTs | Examples of FFTs from different datasets contained in the package |