Thursday, September 11, 2014

transpose a tab-delimited file in command line

Very often we need to transpose a tab-delimited file, e.g. rows --> columns and columns --> rows. For example, I have a SNP file like below, each row is SNP and each column is a sample:

$ cat SNP.txt
id Sam_01 Sam_02 Sam_03 Sam_04 Sam_05
Snp_01 2 0 2 0 2
Snp_02 0 1 1 2 2
Snp_03 1 0 1 0 1
Snp_04 0 1 2 2 2
Snp_05 1 1 2 1 1
Snp_06 2 2 2 1 1
Snp_07 1 1 2 2 0
Snp_08 1 0 1 0 1
Snp_09 2 1 2 2 0

I want to convert it to the following format:

id Snp_01 Snp_02 Snp_03 Snp_04 Snp_05 Snp_06 Snp_07 Snp_08 Snp_09
Sam_01 2 0 1 0 1 2 1 1 2 
Sam_02 0 1 0 1 1 2 1 0 1 
Sam_03 2 1 1 2 2 2 2 1 2 
Sam_04 0 2 0 2 1 1 2 0 2 
Sam_05 2 2 1 2 1 1 0 1 0

We can easily do this in R (e.g.. t(df)), but actually there are also a couple available tools in linux. Here are two I used:

1. rowsToCols from Jim Kent's utility
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/rowsToCols
cat SNP.txt | rowsToCols stdin stdout

2. datamash from GNU
cat SNP.txt | datamash transpose

btw, datamash is really a neat command with many functions, like your swiss-knife for small daily tasks for data scientist. Here is its example page on GNU:
http://www.gnu.org/software/datamash/examples/

No comments:

Post a Comment