Rescaling Data

Greg Hunt

2019-03-11

In this vignette we’ll look at how to use rrscale to re-scale data and help discover latent effects.

First we are going to generate data that is the concatenation of two log-normal groups. We do this by taking the outer product between two i.i.d. log-normal vectors, creating group 1 as:

set.seed(919)
u1 = rlnorm(10)
v1 = rlnorm(10)
Y1 = u1%*%t(v1)

and similarly group 2 as

u2 = rlnorm(10)
v2 = rlnorm(10)
Y2 = .5+u2%*%t(v2)

and then we concatenate these together to make a full data matrix (adding some noise)

Y_nn = rbind(Y1,Y2)
Y = Y_nn + array(rlnorm(prod(dim(Y_nn)),0,.05),dim(Y_nn))

Notice that its difficult to tell the groups apart:

library('reshape2')
library('ggplot2')
group = factor(rep(c(1,2),each=nrow(Y)/2))
levels(group) = c("group1","group2")
mY = melt(data.frame(Y,group),id.vars="group")
ggplot(data=mY,mapping=aes(x=value,color=group))+geom_histogram(bins=100)+geom_vline(data=aggregate(value~group,data=mY,mean),mapping=aes(xintercept=value,linetype=group),size=1.5)

Indeed if we look a t-test between the row means across groups we see no difference

t.test(rowMeans(Y)[group=="group1"],rowMeans(Y)[group=="group2"])
## 
##  Welch Two Sample t-test
## 
## data:  rowMeans(Y)[group == "group1"] and rowMeans(Y)[group == "group2"]
## t = -1.6241, df = 13.639, p-value = 0.1272
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.9623864  0.6916029
## sample estimates:
## mean of x mean of y 
##  2.930059  5.065451

Let’s try this after transforming the data with rrscale:

library('rrscale')
scl = rrscale(Y,run_parallel=FALSE)

after running this we get an estimated transformation to help recover the latent group effect. The element “T_name” tells us that the best transformation is a box-cox-like transformation

scl$T_name
## [1] "box_cox_negative"

and the element “par_hat” tells us the optimal value for the parameter to this transformation:

scl$par_hat
## [1] -0.5206409

we can grab the pre-computed RR transformation from the call to rrscale

trans_Y = scl$RR

or we can use the returned “rr_fn” to calcluate this transformation, they are identical

trans_Y2 = scl$rr_fn(Y)
all(trans_Y2==trans_Y,na.rm=TRUE)
## [1] TRUE

Notice that if we plot the transformed Y we see that the group difference is easier to see:

tmY = melt(data.frame(trans_Y,group),id.vars="group")
ggplot(data=tmY,mapping=aes(x=value,color=group))+geom_histogram(bins=100)+geom_vline(data=aggregate(value~group,data=tmY,mean),mapping=aes(xintercept=value,linetype=group),size=1.5)

indeed the t-test is now significant

t.test(rowMeans(trans_Y)[group=="group1"],rowMeans(trans_Y)[group=="group2"])
## 
##  Welch Two Sample t-test
## 
## data:  rowMeans(trans_Y)[group == "group1"] and rowMeans(trans_Y)[group == "group2"]
## t = -2.7412, df = 16.971, p-value = 0.01394
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.432490 -0.186342
## sample estimates:
##  mean of x  mean of y 
## -0.4046867  0.4047291

If we plot he first two PCs for the transformed and un-transformed data we can see the group difference much better after transformation:

plot(svdc(Y)$u[,1:2],col=group)

plot(svdc(trans_Y)$u[,1:2],col=group)

Here we are using the “svdc” function from the rrscale package which calculates ’completed" right and left singular vectors in the presence of missing values. We can also look at the canonical correlation between the group and the first two PCS for the transformed and untransformed data:

cancor(model.matrix(~1+group),svdc(Y)$u[,1:2])
## $cor
## [1] 0.7074969
## 
## $xcoef
##                  [,1]
## groupgroup2 0.4472136
## 
## $ycoef
##            [,1]      [,2]
## [1,] -0.3962615 1.6031726
## [2,]  0.9192846 0.7643497
## 
## $xcenter
## (Intercept) groupgroup2 
##         1.0         0.5 
## 
## $ycenter
## [1] -0.16540613 -0.08246504
cancor(model.matrix(~1+group),svdc(trans_Y)$u[,1:2])
## $cor
## [1] 0.9776598
## 
## $xcoef
##                  [,1]
## groupgroup2 0.4472136
## 
## $ycoef
##          [,1]       [,2]
## [1,] 0.731373 -0.6820391
## [2,] 0.856082  0.9079851
## 
## $xcenter
## (Intercept) groupgroup2 
##         1.0         0.5 
## 
## $ycenter
## [1] -0.001640214 -0.133762711

and we can see that it is much higher for the transformed data signifiying these principal components capture the latent group structure better.