Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other attributes. Trees are especially useful when we are facing hierarchical data. For example, trees are used:
For more details, see the applications vignette by typing
vignette("applications", package = "data.tree")
Tree-like structures are already used in R. For example, environments
can be seen as nodes in a tree. And CRAN provides numerous packages that
deal with tree-like structures, especially in the area of decision
theory. Yet, there is no general purpose hierarchical data structure
that could be used as conveniently and generically as, say,
data.frame
.
As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames. But often, hierarchies don’t marry with tables, and various workarounds are usually required.
data.tree
This package offers an alternative. The data.tree
package lets you create hierarchies, called data.tree
structures. The building block of theses structures are
Node
objects. The package provides basic traversal, search,
and sort operations, and an infrastructure for recursive tree
programming. You can decorate Nodes
with your own
attributes and methods, so as to extend the package to your needs.
The package also provides convenience methods for neatly printing and
plotting trees. It supports conversion from and to
data.frames
, lists
, and other tree structures
such as dendrogram
, phylo
objects from the ape
package, igraph
, and other packages.
Technically, data.tree
structures are bi-directional,
ordered trees. Bi-directional means that you can navigate from parent to
children and vice versa. Ordered means that the sort order of the
children of a parent node is well-defined.
data.tree
basicsdata.tree
structure: a tree,
consisting of multiple Node
objects. Often, the entry point
to a data.tree
structure is the root NodeNode
: both a class and the basic
building block of data.tree
structures?attr
,
which have a different meaning. Many methods and functions have an
attribute
arg, which can refer to a an active, a field or a
method. For example, see ?Get
Node
that can be called like an attribute, but behaves like
a function without arguments. For example:
node$position
Node
,
e.g. node$cost <- 2500
Node
in this context). Many methods are available in OO
style (e.g. node$Revert()
) or in traditional style
(Revert(node)
)Node
inherits e.g. an
attribute from one of its ancestors. For example, see ?Get
,
?SetNodeStyle
There are different ways to create a data.tree
structure. For example, you can create a tree
programmatically, by conversion from
other R objects, or from a file.
Let’s start by creating a tree programmatically. We do this by
creating Node
objects, and linking them together so as to
define the parent-child relationships.
In this example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 1) is the company. On level 2, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:
library(data.tree)
acme <- Node$new("Acme Inc.")
accounting <- acme$AddChild("Accounting")
software <- accounting$AddChild("New Software")
standards <- accounting$AddChild("New Accounting Standards")
research <- acme$AddChild("Research")
newProductLine <- research$AddChild("New Product Line")
newLabs <- research$AddChild("New Labs")
it <- acme$AddChild("IT")
outsource <- it$AddChild("Outsource")
agile <- it$AddChild("Go agile")
goToR <- it$AddChild("Switch to R")
print(acme)
## levelName
## 1 Acme Inc.
## 2 ¦--Accounting
## 3 ¦ ¦--New Software
## 4 ¦ °--New Accounting Standards
## 5 ¦--Research
## 6 ¦ ¦--New Product Line
## 7 ¦ °--New Labs
## 8 °--IT
## 9 ¦--Outsource
## 10 ¦--Go agile
## 11 °--Switch to R
As you can see from the previous example, each Node
is
identified by its name, i.e. the argument you pass into the
Node$new(name)
constructor. The name needs to be
unique among siblings, such that paths to Nodes
are unambiguous.
Node
inherits from R6
reference class. This
has the following implications:
Node
in OO style,
e.g. acme$Get("name")
Node
exhibits reference semantics. Thus,
multiple variables in R can point to the same Node
, and
modifying a Node
will modify it for all referencing
variables. In the above code example, both acme$IT
and
it
reference the same object. This is different from the
value semantics, which is much more widely used in R.data.frame
Creating a tree programmatically is useful especially in the context
of algorithms. However, most times you will create a tree by conversion.
This could be by conversion from a nested list-of-lists, by conversion
from another R tree-structure (e.g. an ape phylo
), or by
conversion from a data.frame
. For more details on all the
options, type ?as.Node
and refer to the See Also
section.
One of the most common conversions is the one from a
data.frame
in table format. The following code illustrates
this. We load the GNI2014 data from the treemap package. This
data.frame
is in table format, meaning that each row will
represent a leaf in the data.tree
structure:
library(treemap)
data(GNI2014)
head(GNI2014)
## iso3 country continent population GNI
## 3 BMU Bermuda North America 67837 106140
## 4 NOR Norway Europe 4676305 103630
## 5 QAT Qatar Asia 833285 92200
## 6 CHE Switzerland Europe 7604467 88120
## 7 MAC Macao SAR, China Asia 559846 76270
## 8 LUX Luxembourg Europe 491775 75990
Let’s convert that into a data.tree
structure! We start
by defining a pathString. The pathString describes the
hierarchy by defining a path from the root to each leaf. In this
example, the hierarchy comes very naturally:
GNI2014$pathString <- paste("world",
GNI2014$continent,
GNI2014$country,
sep = "/")
Once our pathString is defined, conversion to Node is very easy:
population <- as.Node(GNI2014)
print(population, "iso3", "population", "GNI", limit = 20)
## levelName iso3 population GNI
## 1 world NA NA
## 2 ¦--North America NA NA
## 3 ¦ ¦--Bermuda BMU 67837 106140
## 4 ¦ ¦--United States USA 313973000 55200
## 5 ¦ ¦--Canada CAN 33487208 51630
## 6 ¦ ¦--Bahamas, The BHS 309156 20980
## 7 ¦ ¦--Trinidad and Tobago TTO 1310000 20070
## 8 ¦ ¦--Puerto Rico PRI 3971020 19310
## 9 ¦ ¦--Barbados BRB 284589 15310
## 10 ¦ ¦--St. Kitts and Nevis KNA 40131 14920
## 11 ¦ ¦--Antigua and Barbuda ATG 85632 13300
## 12 ¦ ¦--Panama PAN 3360474 11130
## 13 ¦ ¦--Costa Rica CRI 4253877 10120
## 14 ¦ ¦--Mexico MEX 111211789 9870
## 15 ¦ ¦--Grenada GRD 90739 7910
## 16 ¦ ¦--St. Lucia LCA 160267 7260
## 17 ¦ ¦--Dominica DMA 72660 6930
## 18 ¦ ¦--St. Vincent and the Grenadines VCT 104574 6610
## 19 ¦ ¦--Dominican Republic DOM 9650054 6040
## 20 ¦ °--... 7 nodes w/ 0 sub NA NA
## 21 °--... 6 nodes w/ 171 sub NA NA
This is a simple example, and more options are available. Type
?FromDataFrameTable
for all the details.
Often, trees are created from one of many file formats. When
developing this package, We opted for a multi-step approach, meaning
that you first import the file into one of the well-known R data
structures. Then you convert these into a data.tree
structure. For example, typical import patterns could be:
?read.csv
) ->
data.tree (?as.Node.data.frame
)?ape::read.tree
) ->
data.tree (?as.Node.phylo
)?read.csv
)
-> data.tree (c.f. ?FromDataFrameNetwork
)?yaml::yaml.load
) ->
data.tree (?as.Node.list
)?jsonlite::fromJSON
)
-> data.tree (?as.Node.list
)If you have a choice, we recommend you consider yaml format to store and share your hierarchies. It is concise, human-readable, and very easy to convert to a data.tree. An example is provided here for illustration. The data represents what platforms and OS versions a group of students use:
library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
Yosemite:
users: 16
Leopard:
users: 43
Linux:
Debian:
users: 27
Ubuntu:
users: 36
Windows:
W7:
users: 31
W8:
users: 32
W10:
users: 4
"
osList <- yaml.load(yaml)
osNode <- as.Node(osList)
print(osNode, "users")
## levelName users
## 1 OS Students 2014/15 NA
## 2 ¦--OS X NA
## 3 ¦ ¦--Yosemite 16
## 4 ¦ °--Leopard 43
## 5 ¦--Linux NA
## 6 ¦ ¦--Debian 27
## 7 ¦ °--Ubuntu 36
## 8 °--Windows NA
## 9 ¦--W7 31
## 10 ¦--W8 32
## 11 °--W10 4
In cases where your leaf elements have no attributes, you might want
to interpret them as nodes, and not as attributes. In such cases, you
can use interpretNullAsList = TRUE
to convert these into
Nodes
(instead of attributes).
For example:
library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
Yosemite:
Leopard:
Linux:
Debian:
Ubuntu:
Windows:
W7:
W8:
W10:
"
osList <- yaml.load(yaml)
osNode <- as.Node(osList, interpretNullAsList = TRUE)
osNode$printFormatters <- list(h = "\u2500" , v = "\u2502", l = "\u2514", j = "\u251C")
print(osNode, "users")
## levelName users
## 1 OS Students 2014/15 NA
## 2 ├─OS X NA
## 3 │├─Yosemite NA
## 4 │└─Leopard NA
## 5 ├─Linux NA
## 6 │├─Debian NA
## 7 │└─Ubuntu NA
## 8 └─Windows NA
## 9 ├─W7 NA
## 10 ├─W8 NA
## 11 └─W10 NA
As seen above, a data.tree
structure is composed of
Node
objects, and the entry point to a
data.tree
structure is always a Node
, often
the root Node
of a tree.
There are different types of methods:
Nodes
, such as e.g. Node$isRoot
Nodes
, such as
e.g. Node$AddChild(name)
Clone(node)
.Actives look and feel like attributes, but they are dynamically
evaluated. They are documented in the Node
documentation,
which is accessed by typing ?Node
.
Remember our population example:
print(population, limit = 15)
## levelName
## 1 world
## 2 ¦--North America
## 3 ¦ ¦--Bermuda
## 4 ¦ ¦--United States
## 5 ¦ ¦--Canada
## 6 ¦ ¦--Bahamas, The
## 7 ¦ ¦--Trinidad and Tobago
## 8 ¦ ¦--Puerto Rico
## 9 ¦ ¦--Barbados
## 10 ¦ ¦--St. Kitts and Nevis
## 11 ¦ ¦--Antigua and Barbuda
## 12 ¦ ¦--Panama
## 13 ¦ ¦--Costa Rica
## 14 ¦ ¦--Mexico
## 15 ¦ °--... 12 nodes w/ 0 sub
## 16 °--... 6 nodes w/ 176 sub
population$isRoot
## [1] TRUE
population$height
## [1] 3
population$count
## [1] 7
population$totalCount
## [1] 196
population$attributes
## character(0)
population$attributesAll
## [1] "GNI" "continent" "country" "iso3" "population"
population$averageBranchingFactor
## [1] 24.375
The naming convention of the package is that attributes and actives
are lower case, whereas methods are upper / CamelCase. RStudio and other
IDEs work well with data.tree
. If you have a
Node
, simply type myNode$ + SPACE
to get a
list of available attributes, actives and methods.
Examples of OO-Style methods
You will find more information on these examples below.
Get will traverse the tree and collect specific values for the
Nodes
it traverses:
sum(population$Get("population", filterFun = isLeaf))
## [1] 6683146875
Prune traverses the tree and keeps only the subtrees for which the pruneFun returns TRUE.
Prune(population, pruneFun = function(x) !x$isLeaf || x$population > 1000000)
## [1] 39
Note that the Prune function has side-effects, as it acts on the original population object. The population sum is now smaller:
sum(population$Get("population", filterFun = isLeaf), na.rm = TRUE)
## [1] 6669737814
popClone <- Clone(acme)
Traditional S3 generics are available especially for conversion:
as.data.frame(acme)
## levelName
## 1 Acme Inc.
## 2 ¦--Accounting
## 3 ¦ ¦--New Software
## 4 ¦ °--New Accounting Standards
## 5 ¦--Research
## 6 ¦ ¦--New Product Line
## 7 ¦ °--New Labs
## 8 °--IT
## 9 ¦--Outsource
## 10 ¦--Go agile
## 11 °--Switch to R
Though there is also a more specialised non-generic version:
ToDataFrameNetwork(acme)
## from to
## 1 Acme Inc. Accounting
## 2 Acme Inc. Research
## 3 Acme Inc. IT
## 4 Accounting New Software
## 5 Accounting New Accounting Standards
## 6 Research New Product Line
## 7 Research New Labs
## 8 IT Outsource
## 9 IT Go agile
## 10 IT Switch to R
Just as with, say, a list
, we can add any custom field
to any Node
in a data.tree
structure. Let’s go
back to our acme company:
acme
## levelName
## 1 Acme Inc.
## 2 ¦--Accounting
## 3 ¦ ¦--New Software
## 4 ¦ °--New Accounting Standards
## 5 ¦--Research
## 6 ¦ ¦--New Product Line
## 7 ¦ °--New Labs
## 8 °--IT
## 9 ¦--Outsource
## 10 ¦--Go agile
## 11 °--Switch to R
We now add costs and probabilities to the projects in each department:
acme$Accounting$`New Software`$cost <- 1000000
acme$Accounting$`New Accounting Standards`$cost <- 500000
acme$Research$`New Product Line`$cost <- 2000000
acme$Research$`New Labs`$cost <- 750000
acme$IT$Outsource$cost <- 400000
acme$IT$`Go agile`$cost <- 250000
acme$IT$`Switch to R`$cost <- 50000
acme$Accounting$`New Software`$p <- 0.5
acme$Accounting$`New Accounting Standards`$p <- 0.75
acme$Research$`New Product Line`$p <- 0.25
acme$Research$`New Labs`$p <- 0.9
acme$IT$Outsource$p <- 0.2
acme$IT$`Go agile`$p <- 0.05
acme$IT$`Switch to R`$p <- 1
print(acme, "cost", "p")
## levelName cost p
## 1 Acme Inc. NA NA
## 2 ¦--Accounting NA NA
## 3 ¦ ¦--New Software 1000000 0.50
## 4 ¦ °--New Accounting Standards 500000 0.75
## 5 ¦--Research NA NA
## 6 ¦ ¦--New Product Line 2000000 0.25
## 7 ¦ °--New Labs 750000 0.90
## 8 °--IT NA NA
## 9 ¦--Outsource 400000 0.20
## 10 ¦--Go agile 250000 0.05
## 11 °--Switch to R 50000 1.00
Note that there is a list of reserved names you cannot use as
Node
attributes:
NODE_RESERVED_NAMES_CONST
## [1] "AddChild" "AddChildNode" "AddSibling"
## [4] "AddSiblingNode" "attributes" "attributesAll"
## [7] "averageBranchingFactor" "children" "Climb"
## [10] "Navigate" "FindNode" "clone"
## [13] "count" "Do" "fields"
## [16] "fieldsAll" "Get" "GetAttribute"
## [19] "height" "initialize" "isBinary"
## [22] "isLeaf" "isRoot" "leafCount"
## [25] "leaves" "level" "levelName"
## [28] "name" "parent" "path"
## [31] "pathString" "position" "printFormatters"
## [34] "Prune" "Revert" "RemoveAttribute"
## [37] "RemoveChild" "root" "Set"
## [40] "siblings" "Sort" "totalCount"
## [43] ".*"
An alternative, often convenient way to assign custom attributes is
in the constructor, or in the Node$AddChild
method:
birds <- Node$new("Aves", vulgo = "Bird")
birds$AddChild("Neognathae", vulgo = "New Jaws", species = 10000)
birds$AddChild("Palaeognathae", vulgo = "Old Jaws", species = 60)
print(birds, "vulgo", "species")
## levelName vulgo species
## 1 Aves Bird NA
## 2 ¦--Neognathae New Jaws 10000
## 3 °--Palaeognathae Old Jaws 60
Nothing stops you from setting a function as a field. This calculates
a value dynamically, i.e. whenever a field is accessed in tree
traversal. For example, you can add a new Node
to your
structure, and the function will reflect this. Think of this as a
hierarchical spreadsheet, in which you can set formulas into cells.
Consider the following example:
birds$species <- function(self) sum(sapply(self$children, function(x) x$species))
print(birds, "species")
## levelName species
## 1 Aves 10060
## 2 ¦--Neognathae 10000
## 3 °--Palaeognathae 60
data.tree maps the self
argument to the
Node
at hand. Thus, you must name the argument
self
.
Now, let’s assume we discover a new species. Then, the species on the root adjusts dynamically:
birds$Palaeognathae$species <- 61
print(birds, "species")
## levelName species
## 1 Aves 10061
## 2 ¦--Neognathae 10000
## 3 °--Palaeognathae 61
This, together with the Set
method and recursion,
becomes a very powerful tool, as we’ll see later.
Basic printing is easy, as you surely have noted in the previous
sections. print
displays a tree in a tree-grid view. On the
left, you have the hierarchy. Then you have a column per variable you
want to print:
print(acme, "cost", "p")
## levelName cost p
## 1 Acme Inc. NA NA
## 2 ¦--Accounting NA NA
## 3 ¦ ¦--New Software 1000000 0.50
## 4 ¦ °--New Accounting Standards 500000 0.75
## 5 ¦--Research NA NA
## 6 ¦ ¦--New Product Line 2000000 0.25
## 7 ¦ °--New Labs 750000 0.90
## 8 °--IT NA NA
## 9 ¦--Outsource 400000 0.20
## 10 ¦--Go agile 250000 0.05
## 11 °--Switch to R 50000 1.00
For more advanced printing, you have a few options.
You can use formatters to output a variable in a certain way. You can use formatters in two ways:
Node
using the
SetFormat
method. If you do this, then the formatter will
be picked up as a default formatter whenever you print
,
Get
, convert to data.frame
, etc. Formatters
can be set on any Node
in a data.tree
structure act on any descendant. So you can overwrite a formatter for a
sub-tree.Get
method (see below). This will overwrite default formatters previously
set via the SetFormat
method. You can also set the
formatter to identity
to void a default formatter.Setting a formatter using the SetFormat
method:
SetFormat(acme, "p", formatFun = FormatPercent)
SetFormat(acme, "cost", formatFun = function(x) FormatFixedDecimal(x, digits = 2))
print(acme, "cost", "p")
## levelName cost p
## 1 Acme Inc.
## 2 ¦--Accounting
## 3 ¦ ¦--New Software 1000000.00 50.00 %
## 4 ¦ °--New Accounting Standards 500000.00 75.00 %
## 5 ¦--Research
## 6 ¦ ¦--New Product Line 2000000.00 25.00 %
## 7 ¦ °--New Labs 750000.00 90.00 %
## 8 °--IT
## 9 ¦--Outsource 400000.00 20.00 %
## 10 ¦--Go agile 250000.00 5.00 %
## 11 °--Switch to R 50000.00 100.00 %
Get
Formatting with the Get
method overwrites any formatters
found along the path:
data.frame(cost = acme$Get("cost", format = function(x) FormatFixedDecimal(x, 2)),
p = acme$Get("p", format = FormatPercent))
## cost p
## Acme Inc.
## Accounting
## New Software 1000000.00 50.00 %
## New Accounting Standards 500000.00 75.00 %
## Research
## New Product Line 2000000.00 25.00 %
## New Labs 750000.00 90.00 %
## IT
## Outsource 400000.00 20.00 %
## Go agile 250000.00 5.00 %
## Switch to R 50000.00 100.00 %
plot
data.tree
is mainly a data structure. As it is easy to
convert data.tree
structures to other formats, you have
access to a large number of tools to plot a data.tree
structure. For example, you can plot a data.tree
structure
as a dendrogram, as an ape tree, as a treeview, etc. Additionally,
data.tree
also provides its own plotting facility. It is
built on GraphViz/DiagrammeR, and you can access these features via the
plot
and ToGraphViz
functions. Note that
DiagrammeR is not required to use data.tree, so plot
only
works if DiagrammeR is installed on your system. For example:
plot(acme)
Similar to formatters for printing, you can style your tree and store the styling directly in the tree, for later use:
SetGraphStyle(acme, rankdir = "TB")
SetEdgeStyle(acme, arrowhead = "vee", color = "grey35", penwidth = 2)
SetNodeStyle(acme, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow",
fontname = "helvetica", tooltip = GetDefaultTooltip)
SetNodeStyle(acme$IT, fillcolor = "LightBlue", penwidth = "5px")
plot(acme)
For details on the styling attributes, see https://graphviz.org/Documentation.php .
Note that, by default, most Node style attributes will be inherited.
Though, for example, label
will not be inherited. However,
inheritance can be avoided for all style attributes, as for the
Accounting node in the following example:
SetNodeStyle(acme$Accounting, inherit = FALSE, fillcolor = "Thistle",
fontcolor = "Firebrick", tooltip = "This is the accounting department")
plot(acme)