Showing posts with label data transformation. Show all posts
Showing posts with label data transformation. Show all posts

Thursday, September 11, 2014

transpose a tab-delimited file in command line

Very often we need to transpose a tab-delimited file, e.g. rows --> columns and columns --> rows. For example, I have a SNP file like below, each row is SNP and each column is a sample:

$ cat SNP.txt
id Sam_01 Sam_02 Sam_03 Sam_04 Sam_05
Snp_01 2 0 2 0 2
Snp_02 0 1 1 2 2
Snp_03 1 0 1 0 1
Snp_04 0 1 2 2 2
Snp_05 1 1 2 1 1
Snp_06 2 2 2 1 1
Snp_07 1 1 2 2 0
Snp_08 1 0 1 0 1
Snp_09 2 1 2 2 0

I want to convert it to the following format:

id Snp_01 Snp_02 Snp_03 Snp_04 Snp_05 Snp_06 Snp_07 Snp_08 Snp_09
Sam_01 2 0 1 0 1 2 1 1 2 
Sam_02 0 1 0 1 1 2 1 0 1 
Sam_03 2 1 1 2 2 2 2 1 2 
Sam_04 0 2 0 2 1 1 2 0 2 
Sam_05 2 2 1 2 1 1 0 1 0

We can easily do this in R (e.g.. t(df)), but actually there are also a couple available tools in linux. Here are two I used:

1. rowsToCols from Jim Kent's utility
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/rowsToCols
cat SNP.txt | rowsToCols stdin stdout

2. datamash from GNU
cat SNP.txt | datamash transpose

btw, datamash is really a neat command with many functions, like your swiss-knife for small daily tasks for data scientist. Here is its example page on GNU:
http://www.gnu.org/software/datamash/examples/

Wednesday, May 02, 2012

data transformation: variance-stabilizing transformation

Feeling good to read this wiki article in the morning, now I understand a bit more why we usually use logarithm (e.g. log2(x)) to transform the data before we do regression.

The reason to do variance-stabilizing transformation is to limit/remove the relationship between mean and variance. One of very important assumptions of linear regression is the constant variance (also know as homoskedasticity), see the Assumptions section of Linear regression wiki (http://en.wikipedia.org/wiki/Linear_regression#Assumptions). Violation of the assumption will lead to "less precise parameter estimates and misleading inferential quantities such as standard errors" (from wiki). I've not fully understood this part yet. But nevertheless, wiki article has pointed out several ways to fix the problem, among which is variance-stabilizing transformation (VST), e.g. logarithm. I'd like to try the other way, such as Bayesian linear regression, in future study.

OK. Another key question is: how to choose a proper VST?

Here is a very nice one-page math: http://www.stat.ufl.edu/~winner/sta6207/transform.pdf

To be simple, if variance of X is function of mean, V(X)=g(u), where u is mean of variable X (e.g. u=E(X)), then the proper VST is:
For example, the Anscombe transform is to transform Poisson distribution (where u=v) to Gaussian distribution:

If V(X)=g(u)=u^2, then the integration of 1/x is log(x). 
That's what we usually used for microarray, RNAseq transformation. But do we test the V(X)=u^2 relationship?

We can test the homoskedasticity of residual by Breusch–Pagan testbptest {lmtest} is the R function. Here is the example output:
> library('lmtest')
> ## generate a regressor
> x <- rep(c(-1,1), 50)
> ## generate heteroskedastic and homoskedastic disturbances
> err1 <- rnorm(100, sd=rep(c(1,2), 50))
> err2 <- rnorm(100)
> ## generate a linear relationship
> y1 <- 1 + x + err1
> y2 <- 1 + x + err2
> ## perform Breusch-Pagan test
> bptest(y1 ~ x)
studentized Breusch-Pagan test
data:  y1 ~ x
BP = 15.867, df = 1, p-value = 6.795e-05
> bptest(y2 ~ x)
studentized Breusch-Pagan test
data:  y2 ~ x
BP = 7e-04, df = 1, p-value = 0.9784

y1 is significantly heteroskedastic, because the p-value is <0.05.