ClusTorus is a package for clustering angular data, especially for protein structure data. ClusTorus provides various clustering algorithms designed with conformal prediction framework, which can deal with the outliers. The package suggests various methods for fitting the algorithms, and some of them will be introduced soon. The package also provides some simple tools for handling angluar data, such as angular subtraction, computing angular distance, etc. Now, check how to use ClusTorus briefly.

Data Loading and Handling

ClusTorus provides a protein structure data of SARS-CoV-2, which provoked the recent pandemic situation with COVID-19. With this data, we will introduce how to load the data and how to use data handling tools of ClusTorus. Since SARS-CoV-2 data is originally extracted from protein data bank, it contains various information for describing protein structure. For our purpose, we will only use SARS_CoV_2$tbl, which only contains angluar information of the data. Moreover, for simplicity, we will only use backbone-chain angles, that is, \(\phi\) and \(\psi\) columns.

library(ClusTorus)
library(tidyverse)

data <- SARS_CoV_2$tbl
data <- data.frame(data, id = rownames(data)) %>%
  separate(id,into = c(',','position','type','rest'))
data <- data %>% select(-",")

data <- data[(data$type == "B"), ]
data <- na.omit(data[, 1:2])
head(data)
#>                   phi        psi
#>   27.B.ALA  -52.38603  146.62861
#>   28.B.TYR -134.44609  152.58700
#>   29.B.THR -134.15780  175.49552
#>   30.B.ASN  -89.74427  117.60396
#>   31.B.SER  -71.24783  -15.14771
#>   32.B.PHE   51.81731 -117.41821

To use the functions of ClusTorus, we need to convert the data to be on radian scale, especially on \([0, 2\pi)\).

data <- data / 180 * pi
data <- on.torus(data)
ggplot(data = as.data.frame(data), mapping = aes(x = phi, y = psi)) + geom_point()

on.torus converts the radian scale data to be on \([0, 2\pi)^d\), where \(d\) means the number of dimension. In this case, \(d = 2\) and thus the data is converted to be on \([0, 2\pi)^2\).

Clustering with Various Options

Now, we are ready to implement clustering algorithms to the data. ClusTorus provides various options for clustering, but we will provide only several cases: “kde”, “mixture - axis aligned”, and “kmeans - general”. On the other hand, we need to choose hyperparameters: the number of modes or ellipsoids \(J\) and the significance level \(\alpha\). Before choosing the hyperparameter, we will implement the model fitting function icp.torus.score with various hyperparameter options, first.

set.seed(2021)

Jvec <- 3:35
l <- list()

for (j in Jvec){
  l[[j]] <- icp.torus.score(as.matrix(data),
                            method = "all",
                            mixturefitmethod = "axis-aligned",
                            kmeansfitmethod = "general",
                            param = list(J = j, concentration = 25), 
                            verbose = FALSE)
}

The list l contains the icp.torus objects, which consist of fitted parameters for generating clusters, by varying the hypterparameter \(J\). That is, These objects are optimally fitted ingredients for generating clusters for given \(J\). By specifying the significance level, we can get the clusters. But, how to generate optimal clusters? One may think that the hyperparameter which generates the cluster of the minimum volume/area will be the optimum for given significance level. For dealing with such volume/area based hyperparameter searching, ClusTorus provides icp.torus.eval function, which generates boolean matrices for indicating the inclusion of pre-specified evaluation points for given significance level(s). You need to know that the result also changes by varying \(\alpha\). Thus, we need to carefully choose \(\alpha\) for better performance under pre-specified hyperparameter-choosing criterion. But, for simplicity, we will assume that the significance level \(\alpha\) is given with \(0.1\). Then, we can simply get the results by using pre-obtained icp.torus objects as below:

alpha <- 0.1
out <- data.frame()

for (j in Jvec){
  a<-icp.torus.eval(l[[j]], level = alpha, eval.point = grid.torus())
  mu_kde <- sum(a$Chat_kde[,1])
  mu_e <- sum(a$Chat_e[,1])
  mu_kmeans <- sum(a$Chat_kmeans[,1])
  out <- rbind(out, data.frame(J = j, mu_kde = mu_kde, mu_e = mu_e, mu_kmeans = mu_kmeans))
}

head(out)
#>   J mu_kde mu_e mu_kmeans
#> 1 3   1619 2316      2239
#> 2 4   1372 2209      2024
#> 3 5   1601 2221      1645
#> 4 6   1483 1896      2253
#> 5 7   1799 1952      1853
#> 6 8   1578 1871      1703

We count the number of points included in the generated clusters. In this 2-dimensional case, grid.torus() generates the 10,000 grid points in default. That is, icp.torus.eval generates the matrices for indicating whether each grid point is included in the clusters. By counting the included points, we can approximately estimate the volume of clusters. mu_kde shows the volume of cluster generated by option kde, mu_e by option mixture with axis-aligned, and mu_kmeans by option kmeans with general. Now, with “minimum volume” criterion, choose optimal \(J\) for each method, and generate the clusters.

J_kde <- out$J[which.min(out$mu_kde)]
J_e <- out$J[which.min(out$mu_e)]
J_kmeans <- out$J[which.min(out$mu_kmeans)]

icp.torus.kde <- l[[J_kde]]
icp.torus.mix <- l[[J_e]]
icp.torus.kmeans <- l[[J_kmeans]]

c_mix <- cluster.assign.torus(data, icp.torus.mix, level = alpha)
c_kmeans <- cluster.assign.torus(data, icp.torus.kmeans, level = alpha)

If there are some difficulties for evaluating such volume or for implementing your hyperparameter-choosing criterion, ClusTorus alternatively suggests the function hyperparam.torus, which suggests two hyperparameters \(J\) and \(\alpha\). For the choice of \(J\), there are several options such as “elbow”, “risk”, “AIC” and “BIC”. The option “elbow” is actually the volume based criterion, which is already described above. “risk” is the criterion based on the sum of the conformity scores which are evaluated under the inductive conformal prediction framework, that is, evaluated with the function icp.torus.score. “AIC” and “BIC” are well-known information criteria, proposed by Akaike (1974) and Schwarz (1978) respectively. These options can be designated with the argument option. The function hyperparam.torus generates data.frame objects for \(J\) versus the evaluated criterion and for \(\alpha\) versus the number of resultant clusters. It also generates icp.torus object based on the selected \(J\).

output <- hyperparam.torus(data, icp.torus.objects = l, option = "risk")
output$IC.results
#>     J criterion
#> 1   3  1730.001
#> 2   4  1682.979
#> 3   5  1743.655
#> 4   6  1547.127
#> 5   7  1644.920
#> 6   8  1538.558
#> 7   9  1571.729
#> 8  10  1497.837
#> 9  11  1473.851
#> 10 12  1441.767
#> 11 13  1420.352
#> 12 14  1534.647
#> 13 15  1465.857
#> 14 16  1450.854
#> 15 17  1465.153
#> 16 18  1536.920
#> 17 19  1578.592
#> 18 20  1520.175
#> 19 21  1596.795
#> 20 22  1516.367
#> 21 23  1483.848
#> 22 24  1528.707
#> 23 25  1608.701
#> 24 26  1392.156
#> 25 27  1352.146
#> 26 28  1463.455
#> 27 29  1786.932
#> 28 30  1409.073
#> 29 31  1678.029
#> 30 32  1958.643
#> 31 33  1637.330
#> 32 34  1551.388
#> 33 35  1603.736
output$alpha.results
#>          alpha ncluster
#> 1  0.002427184        1
#> 2  0.004854369        1
#> 3  0.007281553        1
#> 4  0.009708738        1
#> 5  0.012135922        1
#> 6  0.014563107        1
#> 7  0.016990291        1
#> 8  0.019417476        2
#> 9  0.021844660        2
#> 10 0.024271845        2
#> 11 0.026699029        2
#> 12 0.029126214        2
#> 13 0.031553398        2
#> 14 0.033980583        2
#> 15 0.036407767        2
#> 16 0.038834951        2
#> 17 0.041262136        2
#> 18 0.043689320        2
#> 19 0.046116505        3
#> 20 0.048543689        3
#> 21 0.050970874        3
#> 22 0.053398058        3
#> 23 0.055825243        3
#> 24 0.058252427        3
#> 25 0.060679612        3
#> 26 0.063106796        3
#> 27 0.065533981        3
#> 28 0.067961165        3
#> 29 0.070388350        3
#> 30 0.072815534        3
#> 31 0.075242718        3
#> 32 0.077669903        5
#> 33 0.080097087        5
#> 34 0.082524272        5
#> 35 0.084951456        5
#> 36 0.087378641        6
#> 37 0.089805825        6
#> 38 0.092233010        6
#> 39 0.094660194        6
#> 40 0.097087379        7
#> 41 0.099514563        7
#> 42 0.101941748        7
#> 43 0.104368932        7
#> 44 0.106796117        7
#> 45 0.109223301        7
#> 46 0.111650485        7
#> 47 0.114077670        7
#> 48 0.116504854        7
#> 49 0.118932039        7
#> 50 0.121359223        7
#> 51 0.123786408        9
#> 52 0.126213592        9
#> 53 0.128640777        9
#> 54 0.131067961        9
#> 55 0.133495146        9
#> 56 0.135922330        9
#> 57 0.138349515        9
#> 58 0.140776699        9
#> 59 0.143203883       10
#> 60 0.145631068       10
#> 61 0.148058252       10
output$optim$hyperparam
#>           J       alpha 
#> 27.00000000  0.06067961

Generating Clusters and Visualization of Clustering Results

With cluster.assign.torus, we can generate the cluster for option mixture and kmeans, and for each data point, the label of cluster is assigned. Moreover, for the cases of mixture and kmeans, we can check how the cluster is generated directly as below:

c_mix$mixture$plot
#> [[1]]