# Introduction to cytometree

cytometree is a package that implements a binary tree algorithm for the analysis of cytometry data.

Its core algorithm is based on the construction of a binary tree, whose nodes represents cell subpopulations.

## Binary tree construction

1. At each node, observed cells (or “events”) and markers are modeled by both a normal distribution (so unimodal), and a mixture of 2 normal distributions (so bimodal).

2. Splitting of the events at each node is done according to a normalized difference of AIC between the two distributional fit (unimodal or bimodal), allowing to pick the marker that best splits those data.

3. When AIC normalized differences $$D$$ are not significant anymore, the tree has been constructed, and the cells have been automatically gated (i.e. partitioned).

## Post-hoc annotation

Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker.

# Examples of analysis with cytometree

## Diffuse large B-cell lymphoma dataset with 3 markers

In this example, we will use a diffuse large B-cell lymphoma dataset (from the FlowCAP-I challenge, with only 3 markers.

### Automatic gating step

First, we need to load the package cytometree and we can have a look at the data:

## [1] 5524    4
##   FL1 FL2 FL4 label
## 1 416 251 293     2
## 2 210  93 184     2
## 3 387 300 233     2
## 4 403 438 288     2
## 5 399 128 116     2
## 6 538 288 189     2

We have 3 markers measured FL1, FL2, FL4 as well as the label obtained from manual gating, and 5524 cells.

The function CytomeTree builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels attribute):

One can fiddle with the t threshold to change the desired depth of the tree.

The function plot_graph plots a graph representing this binary tree, showing which markers were used in which order to splits the events:

The function plot_nodes allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node:

### Post-processing annotation of automatically gated subpopulations

The function Annotation annotates the subpopulations partitioned by the binary tree:

##   FL1 FL2 FL4 leaves count   prop
## 1   0   1   0      1  4905 0.8879
## 2   0   0   1      2   590 0.1068
## 3   0   1   1      3    25 0.0045
## 4   1   1   1      4     4 0.0007

The post-processing annotation can be compared to the incomplete annotation obtained from the tree alone:

##   FL1 FL2 FL4 labels count   prop
## 1  NA  NA   0      1  4905 0.8879
## 2  NA   0   1      2   590 0.1068
## 3   0   1   1      3    25 0.0045
## 4   1   1   1      4     4 0.0007

This completed annotation eases the search for specific cell types of interest, where 1 represents a positive measurement of a marker, while 0 corresponds to its absence:

## [[1]]
## [[1]]$phenotype ## [1] "FL2-1" "FL4-1" ## ## [[1]]$proportion
## [1] 0.0052
##
## [[1]]$Mergedlabels ## [1] 3 4 ## ## [[1]]$Newlabel
## [1] 5
## [[1]]
## [[1]]$phenotype ## [1] "FL2-1" "FL4-0" ## ## [[1]]$proportion
## [1] 0.8879
##
## [[1]]$label ## [1] 1 ## ## ## [[2]] ## [[2]]$phenotype
## [1] "FL2-0" "FL4-1"
##
## [[2]]$proportion ## [1] 0.1068 ## ## [[2]]$label
## [1] 2

### Comparison between manual and automatic gating

Finally, we can compare the automatic gating obtained using cytometree to the original manual gating (which had 47 events inconsistently gated across different operators performing the manual gating on those very same data - so those 47 cells were identified as outliers in the reference label and were kept out of the evaluation criteria in the FlowCAP-I challenge):

## PBMC sample from the Human Immunology Project

In this example, we will use a dataset from the HIPC (Human Immunology Project Consortium program where a PBMC sample from patient 12828 was analyzed by Stanford.

### Automatic gating step

First, we need to load the package cytometree and we can have a look at the data:

## [1] 33992     7
##           CCR7       CD4   CD45RA      HLADR      CD38      CD8 label
## [1,] 1561.2429 1052.6527 3169.944  889.99200 1326.2701 2733.810     1
## [2,] 1362.2820  978.1219 2826.469 1810.43335 1294.4667 3054.555     1
## [3,] 1321.3787 1180.9497 2780.128  -57.02759  211.0515 2450.657     1
## [4,]  779.8231 1097.8448 3054.914  -67.70583  579.6187 1898.014     1
## [5,]  956.2596  942.8704 2612.188   60.35059 1776.3430 3262.799     1
## [6,] 1100.1621 1080.4724 2560.610  702.39624 1179.5389 2836.238     1

We have 6 markers measured CCR7, CD4, CD45RA, HLADR, CD38, CD8 as well as the label obtained from manual gating, and 33992 cell.

The function CytomeTree builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting labels attribute):

One can fiddle with the t threshold to change the desired depth of the tree.

The function plot_graph plots a graph representing this binary tree, showing which markers were used in which order to splits the events:

The function plot_nodes allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node:

### Post-processing annotation of automatically gated subpopulations

The function Annotation annotates the subpopulations partitioned by the binary tree:

##    CCR7 CD4 CD45RA HLADR CD38 CD8 leaves count   prop
## 5     1   1      1     0    1   0      5 12124 0.3567
## 4     1   1      0     0    0   0      4  8218 0.2418
## 3     1   0      1     0    1   1      3  4588 0.1350
## 8     0   1      0     0    0   0      8  4071 0.1198
## 1     0   0      0     0    0   1      1  3138 0.0923
## 2     0   0      1     0    0   1      2   830 0.0244
## 9     0   1      0     0    1   0      9   475 0.0140
## 6     0   0      0     1    1   1      6   193 0.0057
## 10    0   1      0     1    1   0     10   142 0.0042
## 7     1   0      0     1    1   1      7   102 0.0030
## 11    0   1      0     1    0   0     11    83 0.0024
## 12    0   1      1     1    0   0     12    28 0.0008

This completed annotation eases the search for specific cell types of interest, where 1 represents a positive measurement of a marker, while 0 corresponds to its absence:

## [[1]]
## [[1]]$phenotype ## [1] "CD8-0" "CD4-1" "CCR7-0" "CD45RA-1" ## ## [[1]]$proportion
## [1] 8e-04
##
## [[1]]$label ## [1] 12 ## ## ## [[2]] ## [[2]]$phenotype
## [1] "CD8-0"    "CD4-1"    "CCR7-1"   "CD45RA-1"
##
## [[2]]$proportion ## [1] 0.3567 ## ## [[2]]$label
## [1] 5
##
##
## [[3]]
## [[3]]$phenotype ## [1] "CD8-0" "CD4-1" "CCR7-1" "CD45RA-0" ## ## [[3]]$proportion
## [1] 0.2418
##
## [[3]]$label ## [1] 4 ## ## ## [[4]] ## [[4]]$phenotype
## [1] "CD8-0"    "CD4-1"    "CCR7-0"   "CD45RA-0"
##
## [[4]]$proportion ## [1] 0.1404 ## ## [[4]]$Mergedlabels
## [1]  8  9 10 11
##
## [[4]]\$Newlabel
## [1] 13

## Semi-supervised gating

It is possible to force the markers which will be used to gate the data first, using the force_first_markers argument of the CytomeTree() function. Once those forced markers are used, automatic behavior of the algorithm is resumed for the remaining markers.