Banner

Abraham Lincoln once said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Aunt Margaret used to say, "If you dream of a forest, you'd better learn how to plant a tree."
data.tree says, "No matter if you are a lumberjack or a tree hugger. I will be your sanding block, and I will be your seed."

Introduction

Trees

Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other attributes. Trees are especially useful when we are facing hierarchical data. For example, trees are used:

  • in decision theory (cf. decision trees)
  • in machine learning (e.g. classification trees)
  • in finance, e.g. to classify financial instruments into asset classes
  • in routing algorithms
  • in computer science and programming (e.g. binary search trees, XML)
  • e.g. for family trees

For more details, see the applications vignette by typing vignette("applications", package = "data.tree")

Trees in R

Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no general purpose hierarchical data structure that could be used as conveniently and generically as, say, data.frame.

As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames. But often, hierarchies don’t marry with tables, and various workarounds are usually required.

Trees in data.tree

This package offers an alternative. The data.tree package lets you create hierarchies, called data.tree structures. The building block of theses structures are Node objects. The package provides basic traversal, search, and sort operations, and an infrastructure for recursive tree programming. You can decorate Nodes with your own attributes and methods, so as to extend the package to your needs.

The package also provides convenience methods for neatly printing and plotting trees. It supports conversion from and to data.frames, lists, and other tree structures such as dendrogram, phylo objects from the ape package, igraph, and other packages.

Technically, data.tree structures are bi-directional, ordered trees. Bi-directional means that you can navigate from parent to children and vice versa. Ordered means that the sort order of the children of a parent node is well-defined.

data.tree basics

Definitions

  • data.tree structure: a tree, consisting of multiple Node objects. Often, the entry point to a data.tree structure is the root Node
  • Node: both a class and the basic building block of data.tree structures
  • attribute: an active, a field, or a method. **Not to be confused with standard R attributes, c.f. ?attr, which have a different meaning. Many methods and functions have an attribute arg, which can refer to a an active, a field or a method. For example, see ?Get
  • active (sometimes called property): a field on a Node that can be called like an attribute, but behaves like a function without arguments. For example: node$position
  • field: a named value on a Node, e.g. node$cost <- 2500
  • method: a function acting on an object (on a Node in this context). Many methods are available in OO style (e.g. node$Revert()) or in traditional style (Revert(node))
  • inheritance: in this context, inheritance refers to a situation in which a child Node inherits e.g. an attribute from one of its ancestors. For example, see ?Get, ?SetNodeStyle

Tree creation

There are different ways to create a data.tree structure. For example, you can create a tree programmatically, by conversion from other R objects, or from a file.

Create a tree programmatically

Let’s start by creating a tree programmatically. We do this by creating Node objects, and linking them together so as to define the parent-child relationships.

In this example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 1) is the company. On level 2, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:

library(data.tree)

acme <- Node$new("Acme Inc.")
  accounting <- acme$AddChild("Accounting")
    software <- accounting$AddChild("New Software")
    standards <- accounting$AddChild("New Accounting Standards")
  research <- acme$AddChild("Research")
    newProductLine <- research$AddChild("New Product Line")
    newLabs <- research$AddChild("New Labs")
  it <- acme$AddChild("IT")
    outsource <- it$AddChild("Outsource")
    agile <- it$AddChild("Go agile")
    goToR <- it$AddChild("Switch to R")

print(acme)
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

As you can see from the previous example, each Node is identified by its name, i.e. the argument you pass into the Node$new(name) constructor. The name needs to be unique among siblings, such that paths to Nodes are unambiguous.

Node inherits from R6 reference class. This has the following implications:

  1. You can call methods on a Node in OO style, e.g. acme$Get("name")
  2. Node exhibits reference semantics. Thus, multiple variables in R can point to the same Node, and modifying a Node will modify it for all referencing variables. In the above code example, both acme$IT and it reference the same object. This is different from the value semantics, which is much more widely used in R.

Create a tree from a data.frame

Creating a tree programmatically is useful especially in the context of algorithms. However, most times you will create a tree by conversion. This could be by conversion from a nested list-of-lists, by conversion from another R tree-structure (e.g. an ape phylo), or by conversion from a data.frame. For more details on all the options, type ?as.Node and refer to the See Also section.

One of the most common conversions is the one from a data.frame in table format. The following code illustrates this. We load the GNI2014 data from the treemap package. This data.frame is in table format, meaning that each row will represent a leaf in the data.tree structure:

library(treemap)
data(GNI2014)
head(GNI2014)
##   iso3          country     continent population    GNI
## 3  BMU          Bermuda North America      67837 106140
## 4  NOR           Norway        Europe    4676305 103630
## 5  QAT            Qatar          Asia     833285  92200
## 6  CHE      Switzerland        Europe    7604467  88120
## 7  MAC Macao SAR, China          Asia     559846  76270
## 8  LUX       Luxembourg        Europe     491775  75990

Let’s convert that into a data.tree structure! We start by defining a pathString. The pathString describes the hierarchy by defining a path from the root to each leaf. In this example, the hierarchy comes very naturally:

GNI2014$pathString <- paste("world", 
                            GNI2014$continent, 
                            GNI2014$country, 
                            sep = "/")

Once our pathString is defined, conversion to Node is very easy:

population <- as.Node(GNI2014)
print(population, "iso3", "population", "GNI", limit = 20)
##                                 levelName iso3 population    GNI
## 1  world                                               NA     NA
## 2   ¦--North America                                   NA     NA
## 3   ¦   ¦--Bermuda                         BMU      67837 106140
## 4   ¦   ¦--United States                   USA  313973000  55200
## 5   ¦   ¦--Canada                          CAN   33487208  51630
## 6   ¦   ¦--Bahamas, The                    BHS     309156  20980
## 7   ¦   ¦--Trinidad and Tobago             TTO    1310000  20070
## 8   ¦   ¦--Puerto Rico                     PRI    3971020  19310
## 9   ¦   ¦--Barbados                        BRB     284589  15310
## 10  ¦   ¦--St. Kitts and Nevis             KNA      40131  14920
## 11  ¦   ¦--Antigua and Barbuda             ATG      85632  13300
## 12  ¦   ¦--Panama                          PAN    3360474  11130
## 13  ¦   ¦--Costa Rica                      CRI    4253877  10120
## 14  ¦   ¦--Mexico                          MEX  111211789   9870
## 15  ¦   ¦--Grenada                         GRD      90739   7910
## 16  ¦   ¦--St. Lucia                       LCA     160267   7260
## 17  ¦   ¦--Dominica                        DMA      72660   6930
## 18  ¦   ¦--St. Vincent and the Grenadines  VCT     104574   6610
## 19  ¦   ¦--Dominican Republic              DOM    9650054   6040
## 20  ¦   °--... 7 nodes w/ 0 sub                        NA     NA
## 21  °--... 6 nodes w/ 171 sub                          NA     NA

This is a simple example, and more options are available. Type ?FromDataFrameTable for all the details.

Create a tree from a file

Often, trees are created from one of many file formats. When developing this package, We opted for a multi-step approach, meaning that you first import the file into one of the well-known R data structures. Then you convert these into a data.tree structure. For example, typical import patterns could be:

  • csv -> data.frame in table format (?read.csv) -> data.tree (?as.Node.data.frame)
  • Newick -> ape phylo (?ape::read.tree) -> data.tree (?as.Node.phylo )
  • csv -> data.frame in network format (?read.csv) -> data.tree (c.f. ?FromDataFrameNetwork)
  • yaml -> list of lists (?yaml::yaml.load) -> data.tree (?as.Node.list)
  • json -> list of lists (e.g. ?jsonlite::fromJSON) -> data.tree (?as.Node.list)

If you have a choice, we recommend you consider yaml format to store and share your hierarchies. It is concise, human-readable, and very easy to convert to a data.tree. An example is provided here for illustration. The data represents what platforms and OS versions a group of students use:

library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
  Yosemite:
    users: 16
  Leopard:
    users: 43
Linux:
  Debian:
    users: 27
  Ubuntu:
    users: 36
Windows:
  W7:
    users: 31
  W8:
    users: 32
  W10:
    users: 4
"

osList <- yaml.load(yaml)
osNode <- as.Node(osList)
print(osNode, "users")
##              levelName users
## 1  OS Students 2014/15    NA
## 2   ¦--OS X               NA
## 3   ¦   ¦--Yosemite       16
## 4   ¦   °--Leopard        43
## 5   ¦--Linux              NA
## 6   ¦   ¦--Debian         27
## 7   ¦   °--Ubuntu         36
## 8   °--Windows            NA
## 9       ¦--W7             31
## 10      ¦--W8             32
## 11      °--W10             4

In cases where your leaf elements have no attributes, you might want to interpret them as nodes, and not as attributes. In such cases, you can use interpretNullAsList = TRUE to convert these into Nodes (instead of attributes).

For example:

library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
  Yosemite:
  Leopard:
Linux:
  Debian:
  Ubuntu:
Windows:
  W7:
  W8:
  W10:
"

osList <- yaml.load(yaml)
osNode <- as.Node(osList, interpretNullAsList = TRUE)
osNode$printFormatters <- list(h = "\u2500" , v = "\u2502", l = "\u2514", j = "\u251C")
print(osNode, "users")
##              levelName users
## 1  OS Students 2014/15    NA
## 2  ├─OS X                 NA
## 3  │├─Yosemite            NA
## 4  │└─Leopard             NA
## 5  ├─Linux                NA
## 6  │├─Debian              NA
## 7  │└─Ubuntu              NA
## 8  └─Windows              NA
## 9  ├─W7                   NA
## 10 ├─W8                   NA
## 11 └─W10                  NA

Node methods

As seen above, a data.tree structure is composed of Node objects, and the entry point to a data.tree structure is always a Node, often the root Node of a tree.

There are different types of methods:

  • OO-style actives (sometimes called properties) on Nodes, such as e.g. Node$isRoot
  • OO-style methods on Nodes, such as e.g. Node$AddChild(name)
  • Classical R methods, such as e.g. Clone(node).

Actives Examples (aka Properties)

Actives look and feel like attributes, but they are dynamically evaluated. They are documented in the Node documentation, which is accessed by typing ?Node.

Remember our population example:

print(population, limit = 15)
##                        levelName
## 1  world                        
## 2   ¦--North America            
## 3   ¦   ¦--Bermuda              
## 4   ¦   ¦--United States        
## 5   ¦   ¦--Canada               
## 6   ¦   ¦--Bahamas, The         
## 7   ¦   ¦--Trinidad and Tobago  
## 8   ¦   ¦--Puerto Rico          
## 9   ¦   ¦--Barbados             
## 10  ¦   ¦--St. Kitts and Nevis  
## 11  ¦   ¦--Antigua and Barbuda  
## 12  ¦   ¦--Panama               
## 13  ¦   ¦--Costa Rica           
## 14  ¦   ¦--Mexico               
## 15  ¦   °--... 12 nodes w/ 0 sub
## 16  °--... 6 nodes w/ 176 sub
population$isRoot
## [1] TRUE
population$height
## [1] 3
population$count
## [1] 7
population$totalCount
## [1] 196
population$attributes
## character(0)
population$attributesAll
## [1] "GNI"        "continent"  "country"    "iso3"       "population"
population$averageBranchingFactor
## [1] 24.375

The naming convention of the package is that attributes and actives are lower case, whereas methods are upper / CamelCase. RStudio and other IDEs work well with data.tree. If you have a Node, simply type myNode$ + SPACE to get a list of available attributes, actives and methods.

OO-Style Methods Examples

Examples of OO-Style methods

You will find more information on these examples below.

Get will traverse the tree and collect specific values for the Nodes it traverses:

sum(population$Get("population", filterFun = isLeaf))
## [1] 6683146875

Prune traverses the tree and keeps only the subtrees for which the pruneFun returns TRUE.

Prune(population, pruneFun = function(x) !x$isLeaf || x$population > 1000000)
## [1] 39

Note that the Prune function has side-effects, as it acts on the original population object. The population sum is now smaller:

sum(population$Get("population", filterFun = isLeaf), na.rm = TRUE)
## [1] 6669737814

Traditional R Methods

popClone <- Clone(acme)

Traditional S3 generics are available especially for conversion:

as.data.frame(acme)
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

Though there is also a more specialised non-generic version:

ToDataFrameNetwork(acme)
##          from                       to
## 1   Acme Inc.               Accounting
## 2   Acme Inc.                 Research
## 3   Acme Inc.                       IT
## 4  Accounting             New Software
## 5  Accounting New Accounting Standards
## 6    Research         New Product Line
## 7    Research                 New Labs
## 8          IT                Outsource
## 9          IT                 Go agile
## 10         IT              Switch to R

Climbing a tree (tree navigation)

To climb a tree means to navigate to a specific Node in the data.tree structure.

Custom attributes

Just as with, say, a list, we can add any custom field to any Node in a data.tree structure. Let’s go back to our acme company:

acme
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

We now add costs and probabilities to the projects in each department:

acme$Accounting$`New Software`$cost <- 1000000
acme$Accounting$`New Accounting Standards`$cost <- 500000
acme$Research$`New Product Line`$cost <- 2000000
acme$Research$`New Labs`$cost <- 750000
acme$IT$Outsource$cost <- 400000
acme$IT$`Go agile`$cost <- 250000
acme$IT$`Switch to R`$cost <- 50000

acme$Accounting$`New Software`$p <- 0.5
acme$Accounting$`New Accounting Standards`$p <- 0.75
acme$Research$`New Product Line`$p <- 0.25
acme$Research$`New Labs`$p <- 0.9
acme$IT$Outsource$p <- 0.2
acme$IT$`Go agile`$p <- 0.05
acme$IT$`Switch to R`$p <- 1
print(acme, "cost", "p")
##                           levelName    cost    p
## 1  Acme Inc.                             NA   NA
## 2   ¦--Accounting                        NA   NA
## 3   ¦   ¦--New Software             1000000 0.50
## 4   ¦   °--New Accounting Standards  500000 0.75
## 5   ¦--Research                          NA   NA
## 6   ¦   ¦--New Product Line         2000000 0.25
## 7   ¦   °--New Labs                  750000 0.90
## 8   °--IT                                NA   NA
## 9       ¦--Outsource                 400000 0.20
## 10      ¦--Go agile                  250000 0.05
## 11      °--Switch to R                50000 1.00

Note that there is a list of reserved names you cannot use as Node attributes:

NODE_RESERVED_NAMES_CONST
##  [1] "AddChild"               "AddChildNode"           "AddSibling"            
##  [4] "AddSiblingNode"         "attributes"             "attributesAll"         
##  [7] "averageBranchingFactor" "children"               "Climb"                 
## [10] "Navigate"               "FindNode"               "clone"                 
## [13] "count"                  "Do"                     "fields"                
## [16] "fieldsAll"              "Get"                    "GetAttribute"          
## [19] "height"                 "initialize"             "isBinary"              
## [22] "isLeaf"                 "isRoot"                 "leafCount"             
## [25] "leaves"                 "level"                  "levelName"             
## [28] "name"                   "parent"                 "path"                  
## [31] "pathString"             "position"               "printFormatters"       
## [34] "Prune"                  "Revert"                 "RemoveAttribute"       
## [37] "RemoveChild"            "root"                   "Set"                   
## [40] "siblings"               "Sort"                   "totalCount"            
## [43] ".*"

Custom attributes in constructor

An alternative, often convenient way to assign custom attributes is in the constructor, or in the Node$AddChild method:

birds <- Node$new("Aves", vulgo = "Bird")
birds$AddChild("Neognathae", vulgo = "New Jaws", species = 10000)
birds$AddChild("Palaeognathae", vulgo = "Old Jaws", species = 60)
print(birds, "vulgo", "species")
##           levelName    vulgo species
## 1 Aves                  Bird      NA
## 2  ¦--Neognathae    New Jaws   10000
## 3  °--Palaeognathae Old Jaws      60

Custom attributes as function

Nothing stops you from setting a function as a field. This calculates a value dynamically, i.e. whenever a field is accessed in tree traversal. For example, you can add a new Node to your structure, and the function will reflect this. Think of this as a hierarchical spreadsheet, in which you can set formulas into cells.

Consider the following example:

birds$species <- function(self) sum(sapply(self$children, function(x) x$species))
print(birds, "species")
##           levelName species
## 1 Aves                10060
## 2  ¦--Neognathae      10000
## 3  °--Palaeognathae      60

data.tree maps the self argument to the Node at hand. Thus, you must name the argument self.

Now, let’s assume we discover a new species. Then, the species on the root adjusts dynamically:

birds$Palaeognathae$species <- 61
print(birds, "species")
##           levelName species
## 1 Aves                10061
## 2  ¦--Neognathae      10000
## 3  °--Palaeognathae      61

This, together with the Set method and recursion, becomes a very powerful tool, as we’ll see later.

Printing

Basic Printing

Basic printing is easy, as you surely have noted in the previous sections. print displays a tree in a tree-grid view. On the left, you have the hierarchy. Then you have a column per variable you want to print:

print(acme, "cost", "p")
##                           levelName    cost    p
## 1  Acme Inc.                             NA   NA
## 2   ¦--Accounting                        NA   NA
## 3   ¦   ¦--New Software             1000000 0.50
## 4   ¦   °--New Accounting Standards  500000 0.75
## 5   ¦--Research                          NA   NA
## 6   ¦   ¦--New Product Line         2000000 0.25
## 7   ¦   °--New Labs                  750000 0.90
## 8   °--IT                                NA   NA
## 9       ¦--Outsource                 400000 0.20
## 10      ¦--Go agile                  250000 0.05
## 11      °--Switch to R                50000 1.00

For more advanced printing, you have a few options.

Formatters

You can use formatters to output a variable in a certain way. You can use formatters in two ways:

  • You can set them on a Node using the SetFormat method. If you do this, then the formatter will be picked up as a default formatter whenever you print, Get, convert to data.frame, etc. Formatters can be set on any Node in a data.tree structure act on any descendant. So you can overwrite a formatter for a sub-tree.
  • You can add an explicit ad-hoc formatter to the Get method (see below). This will overwrite default formatters previously set via the SetFormat method. You can also set the formatter to identity to void a default formatter.

Setting a formatter using the SetFormat method:

SetFormat(acme, "p", formatFun = FormatPercent)
SetFormat(acme, "cost", formatFun = function(x) FormatFixedDecimal(x, digits = 2))
print(acme, "cost", "p")
##                           levelName       cost        p
## 1  Acme Inc.                                           
## 2   ¦--Accounting                                      
## 3   ¦   ¦--New Software             1000000.00  50.00 %
## 4   ¦   °--New Accounting Standards  500000.00  75.00 %
## 5   ¦--Research                                        
## 6   ¦   ¦--New Product Line         2000000.00  25.00 %
## 7   ¦   °--New Labs                  750000.00  90.00 %
## 8   °--IT                                              
## 9       ¦--Outsource                 400000.00  20.00 %
## 10      ¦--Go agile                  250000.00   5.00 %
## 11      °--Switch to R                50000.00 100.00 %

Printing using Get

Formatting with the Get method overwrites any formatters found along the path:

data.frame(cost = acme$Get("cost", format = function(x) FormatFixedDecimal(x, 2)),
           p = acme$Get("p", format = FormatPercent))
##                                cost        p
## Acme Inc.                                   
## Accounting                                  
## New Software             1000000.00  50.00 %
## New Accounting Standards  500000.00  75.00 %
## Research                                    
## New Product Line         2000000.00  25.00 %
## New Labs                  750000.00  90.00 %
## IT                                          
## Outsource                 400000.00  20.00 %
## Go agile                  250000.00   5.00 %
## Switch to R                50000.00 100.00 %

Plotting

plot

data.tree is mainly a data structure. As it is easy to convert data.tree structures to other formats, you have access to a large number of tools to plot a data.tree structure. For example, you can plot a data.tree structure as a dendrogram, as an ape tree, as a treeview, etc. Additionally, data.tree also provides its own plotting facility. It is built on GraphViz/DiagrammeR, and you can access these features via the plot and ToGraphViz functions. Note that DiagrammeR is not required to use data.tree, so plot only works if DiagrammeR is installed on your system. For example:

plot(acme)
acme
acme

Styling

Similar to formatters for printing, you can style your tree and store the styling directly in the tree, for later use:

SetGraphStyle(acme, rankdir = "TB")
SetEdgeStyle(acme, arrowhead = "vee", color = "grey35", penwidth = 2)
SetNodeStyle(acme, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow", 
            fontname = "helvetica", tooltip = GetDefaultTooltip)
SetNodeStyle(acme$IT, fillcolor = "LightBlue", penwidth = "5px")
plot(acme)
acme
acme

For details on the styling attributes, see https://graphviz.org/Documentation.php .

Note that, by default, most Node style attributes will be inherited. Though, for example, label will not be inherited. However, inheritance can be avoided for all style attributes, as for the Accounting node in the following example:

SetNodeStyle(acme$Accounting, inherit = FALSE, fillcolor = "Thistle", 
             fontcolor = "Firebrick", tooltip = "This is the accounting department")
plot(acme)