# # ktaucenters package: Robust and efficient Clustering

## ## Introduction

This package implements a clustering algorithm similar to kmeans, it has two main advantages:

• The estimator is resistant to outliers, that means that results of estimator are still correct when there are atipycal values in the sample.

• The estimator is efficient, roughly speaking, if there are not outliers in the sample (all data is good), results will be similar than those obtained by a classic algorithm (kmeans)

Clustering procedure is carried out by minimizing the overall robust scale so-called tau scale (see Yohai Gonzalez, Yohai and Zamar 2019 arxiv:1906.08198).

## # How to use the package ktaucenters

### Example 1: behavior when data are clean

First we load the package ktaucenters

``````rm(list=ls())
library(ktaucenters)``````

We generate synthetic data (three cluster well separated), and apply a classic algorithm (kmeans) and the robust ktaucenters. Results and code are shown below.

``````# Generate synthetic data (three cluster well separated)
set.seed(1)
Z <- rnorm(600);
mues <- rep(c(-4, 0, 4), 200)
X <-  matrix(Z + mues, ncol=2)

### Applying the ROBUST algortihm  ####
ktau_output <- ktaucenters(X, K=3,nstart=10)
### Applying the classic algortihm  ####
kmeans_output <- kmeans(X,centers=3,nstart=10)

### plotting the center results
plot(X,main=" Efficiency")
points(ktau_output\$centers,pch=19,col=2,cex=2)
points(kmeans_output\$centers,pch=17,col=3,cex=2)
legend(-6,6,pch=c(19,17),col=c(2,3),cex=1,legend=c("ktau centers" ,"kmeans centers"))``````

### Example 2: behavior in the presence of outliers

We contaminate the previous data by replacing 60 observations to outliers located in a bounding box that contains the clean data. Then we apply kmeans and ktaucenters algorithms.

``````# Generate 60 sintetic outliers (contamination level 20%)
X[sample(1:300,60), ] <- matrix(runif( 40, 2* min(X), 2 * max(X) ),
ncol = 2, nrow = 60)

### Applying the ROBUST algortihm  ####
ktau_output <- ktaucenters(X, K=3,nstart=10)
### Applying the classic algortihm  ####
kmeans_output <- kmeans(X,centers=3,nstart=10)``````

plotting the estimated centers

``````plot(X,main=" Robustness ")
points(ktau_output\$centers,pch=19,col=2,cex=2)
points(kmeans_output\$centers,pch=17,col=3,cex=2)
legend(-10,10,pch=c(19,17),col=c(2,3),cex=1,legend=c("ktau centers" ,"kmeans centers"))``````

As it can be observed in Figure kmeans center were very influenced by outliers, while ktaucenters results are still razonable.

### Example 3: Showing clusters and outliers detection procedure

Continuation from Example 2, for outliers recognition purposes we can see the `ktau_output\$outliers` that indicates the indices that may be considered as outliers, on the other hand, the labels of each cluster found by the algorithm are coded with integers between 1 and K (in this case K=3), the variable `ktau_output\$clusters` contains that information.

``````plot(X,main=" Estimated clusters and outliers detection ")
## plottig clusters
for (j in 1:3){
points(X[ktau_output\$cluster==j, ], col=j+1)
}

## plottig outliers
points(X[ktau_output\$outliers, ], pch=19, col=1, cex=1)
legend(7,15,pch=c(1,1,1,19),col=c(2,3,4,1),cex=1,
legend=c("cluster 1" ,"cluster 2","cluster 3","detected \n outliers"),bg = "gray")``````

The final figure contains clusters and outliers detected.

## # Improved-ktaucenters

The algorithm ktaucenter works well under noisy data, but fails when clusters have different size, shape and orientation, an algorithm suitable for this sort of data is `improvektaucenters`. To show how this algorithm works we use the data set so-called M5data from package `tclust: tclust: Robust Trimmed Clustering`, tclust. M5 data were generated by three normal bivariate distributions with different scales, one of the components is very overlapped with another one. A 10% background noise is added uniformly

::: {#usage .section .level3} ### usage

First we load the data, then, run the `improvedktaucenters` function.

``````## load non spherical datadata
library("tclust")
data("M5data")
X=M5data[,1:2]
true.clusters=M5data[,3]
### done ######

#run the function to estimate clusters
improved_output=improvedktaucenters(X,K=3,cutoff=0.95)``````

We keep the results in the variable `improved_output`, that is a list that contains the fields `outliers` and `cluster`, among others. For example, we can have access to the cluster labeled as 2 by typing

`X[improved_output\$cluster==2, ]`.

If we want to know the values of outliers, type

`X[improved_output\$outliers, ]`.

By using these commands, it is easy to estimate the original clusters by means of `improvedktaucenters` routine.