One Tip Per Day: format converter

Showing posts with label format converter. Show all posts

Tuesday, July 15, 2014

reshape2: convert table from wide to long format

I found this elegant note about reshape2 from Sean Anderson's blog:
http://seananderson.ca/2013/10/19/reshape.html

Basically,

reshape2 is based around two key functions: melt and cast:
melt takes wide-format data and melts it into long-format data.
cast takes long-format data and casts it into wide-format data.

For example, this is wide format:

> head(fpkm)

ID FPKM.SRR1069188 FPKM.SRR1070986 FPKM.SRR1071289

ENSG00000240361.1 1.00000000 1.000000 1.000000

ENSG00000186092.4 1.00000000 1.000000 1.000000

ENSG00000237613.2 1.00000000 1.000000 1.000000

ENSG00000239906.1 0.05888838 5.139312 5.055983

ENSG00000241860.1 1.20237363 1.160175 1.085992

ENSG00000222623.1 1.00000000 1.000000 1.000000

Using melt to change it into long format:

>require('reshape2')

>head(melt(fpkm))

No id variables; using all as measure variables

variable value

1 FPKM.SRR1069188 1.00000000

2 FPKM.SRR1069188 1.00000000

3 FPKM.SRR1069188 1.00000000

4 FPKM.SRR1069188 0.05888838

5 FPKM.SRR1069188 1.20237363

6 FPKM.SRR1069188 1.00000000

or, you can set the column name by

> head(melt(fpkm, variable.name = "Sample",value.name ="FPKM"))

No id variables; using all as measure variables

Sample FPKM

1 FPKM.SRR1069188 1.00000000

2 FPKM.SRR1069188 1.00000000

3 FPKM.SRR1069188 1.00000000

4 FPKM.SRR1069188 0.05888838

5 FPKM.SRR1069188 1.20237363

6 FPKM.SRR1069188 1.00000000

if you want, you can also keep some of columns as ID in the long format, for example, I want to keep the gene ID in the long format:

>head(melt(fpkm, variable.name = "Sample",value.name ="FPKM", id="ID"))

ID Sample FPKM

1 ENSG00000240361.1 FPKM.SRR1069188 1.00000000

2 ENSG00000186092.4 FPKM.SRR1069188 1.00000000

3 ENSG00000237613.2 FPKM.SRR1069188 1.00000000

4 ENSG00000239906.1 FPKM.SRR1069188 0.05888838

5 ENSG00000241860.1 FPKM.SRR1069188 1.20237363

6 ENSG00000222623.1 FPKM.SRR1069188 1.00000000

I will do the long-->wide example when I have a good case to show... :)

Update: see this post of the long-->wide conversion.

Monday, November 19, 2012

get intron, UTR, CDS from bed12 format

Input: gene annotation in BED12 format (If your input is GTF/GFF, you want to convert into bed12 ahead, see my another post for howto)
Output: intron/UTR/CDS in bed12 format (which can be further converted into bed6 via "bedtools bed12tobed6" command)

# Introns (if any) of gene annotation in BED12 format
cat annotation.bed | awk '{OFS="\t";split($11,a,","); split($12,b,","); A=""; B=""; for(i=1;i<length(a)-1;i++) {A=A""(b[i+1]-b[i]-a[i])",";B=B""(b[i]+a[i]-(b[1]+a[1]))",";} if($10>1) print $1,$2+a[1], $3-a[length(a)-1], $4,$5,$6,$2+a[1], $3-a[length(a)-1],$9,$10-1,A,B;}'

# 5´ UTR (if any) of gene annotation in BED12 format
cat annotation.bed | awk '{OFS="\t";split($11,a,","); split($12,b,","); A=""; B=""; if($7==$8) next; if($6=="+" && $2<$7) {for(i=1;i<length(a);i++) if(($2+b[i]+a[i])<=$7) {A=A""a[i]",";B=B""b[i]",";} else {A=A""($7-$2-b[i])",";B=B""b[i]","; break; } print $1,$2,$7,$4,$5,$6,$2,$7,$9,i,A,B;} if($6=="-" && $8<$3) {for(i=length(a)-1;i>0;i--) if(($2+b[i])>=$8) {A=a[i]","A;B=($2+b[i]-$8)","B;} else {A=($2+b[i]+a[i]-$8)","A;B=0","B; break; } print $1,$8,$3,$4,$5,$6,$8,$3,$9,length(a)-i,A,B;}}'
## Update: the code crossed above did not consider the special case that start codon is spliced (e.g. in intron within in the start codon).
awk '{OFS="\t";split($11,blockSizes,","); split($12,blockStarts,","); blockCount=$10;A=""; B=""; if($7==$8) next;N=0;if($6=="+" && $2<$7) {start=$2;end=$7; for(i=1;i<=blockCount;i++) if(($2+blockStarts[i]+blockSizes[i])<=$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=($2+blockStarts[i]+blockSizes[i]); N++;} else { if(($2+blockStarts[i])<$7) {A=A""($7-$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=$7;} break; } print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;} if($6=="-" && $8<$3) {start=$8;end=$3; for(i=blockCount;i>0;i--) if(($2+blockStarts[i])>=$8) {A=blockSizes[i]","A;B=($2+blockStarts[i]-$8)","B;start=($2+blockStarts[i]); N++;} else {if(($2+blockStarts[i]+blockSizes[i])>$8) {A=($2+blockStarts[i]+blockSizes[i]-$8)","A;B=0","B; N++; start=$8;} break; } print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;}}'

## Update (2017-Jan-28, see comment below)
awk '{OFS="\t";split($11,blockSizes,","); split($12,blockStarts,","); blockCount=$10;A=""; B=""; if($7==$8) next;N=0;if($6=="+" && $2<$7) {start=$2;end=$7; for(i=1;i<=blockCount;i++) if(($2+blockStarts[i]+blockSizes[i])<=$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=($2+blockStarts[i]+blockSizes[i]); N++;} else { if(($2+blockStarts[i])<$7) {A=A""($7-$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=$7;} break; } print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;} if($6=="-" && $8<$3) {start=$8;end=$3; for(i=1;i<=blockCount;i++) if(($2+blockStarts[i])>=$8) {if(start==0) {A=blockSizes[i];B=0; start=$2+blockStarts[i];} else {A=A","blockSizes[i];B=B","($2+blockStarts[i]-start);} N++;} else { if(($2+blockStarts[i]+blockSizes[i])>$8) { A=($2+blockStarts[i]+blockSizes[i]-$8);B=0; N++; start=$8;} if(($2+blockStarts[i]+blockSizes[i])==$8) start=0;} print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;}}'

# CDS (if any) of gene annotation in BED12 format
grep "protein_coding\.protein_coding" annotation.bed | awk '{OFS="\t";split($11,a,","); split($12,b,","); A=""; B=""; if($7==$8) next; j=0; for(i=1;i<length(a);i++) if(($2+b[i]+a[i])>$7 && ($2+b[i])<$8) {j++; start=$2+b[i]-$7; size=a[i]; if(($2+b[i])<=$7) {start=0;size=size-($7-($2+b[i]));} if(($2+a[i]+b[i])>=$8) {size=size-($2+a[i]+b[i]-$8);} A=A""size",";B=B""start",";} print $1,$7,$8,$4,$5,$6,$7,$8,$9,j,A,B;}'

### Note1: The thickStart and thickEnd in the bed12 format don't always indicate CDS. "When there is no thick part, thickStart and thickEnd are usually set to the chromStart position." So, we cannot use bed12 to infer CDS (or coding exons), esp. for lincRNA. The correct way is to grep all CDS lines from the GTF file, or directly using UCSC Table Browser to download coding exons.

### Note2: The "Polymorphic pseudogene" in GENCODE can also have a protein-coding transcript, even though the gene itself is classified as a pseudogene (YES, polymorphic pseudogenes are also pseudogenes, they "are coding gene that are pseudogenic due to the presence of a polymorphic premature stop codon in the reference genome" (http://www.genomebiology.com/2012/13/9/R51). For more details, see https://gencodegenes.wordpress.com/toolbox/.

# 3´ UTR (if any) of gene annotation in BED12 format
cat annotation.bed | awk '{OFS="\t";split($11,blockSizes,","); split($12,blockStarts,","); blockCount=$10;A=""; B=""; if($7==$8) next;N=0;if($6=="-" && $2<$7) {start=$2;end=$7; for(i=1;i<=blockCount;i++) if(($2+blockStarts[i]+blockSizes[i])<=$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=($2+blockStarts[i]+blockSizes[i]); N++;} else { if(($2+blockStarts[i])<$7) {A=A""($7-$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=$7;} break; } print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;} if($6=="+" && $8<$3) {start=$8;end=$3; for(i=blockCount;i>0;i--) if(($2+blockStarts[i])>=$8) {A=blockSizes[i]","A;B=($2+blockStarts[i]-$8)","B;start=($2+blockStarts[i]); N++;} else {if(($2+blockStarts[i]+blockSizes[i])>$8) {A=($2+blockStarts[i]+blockSizes[i]-$8)","A;B=0","B; N++; start=$8;} break; } print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;}}'

cat annotation.bed | awk '{OFS="\t";split($11,blockSizes,","); split($12,blockStarts,","); blockCount=$10;A=""; B=""; if($7==$8) next;N=0;if($6=="-" && $2<$7) {start=$2;end=$7; for(i=1;i<=blockCount;i++) if(($2+blockStarts[i]+blockSizes[i])<=$7) {A=A""blockSizes[i]",";B=B""blockStarts[i]","; end=($2+blockStarts[i]+blockSizes[i]); N++;} else { if(($2+blockStarts[i])<$7) {A=A""($7-$2-blockStarts[i])",";B=B""blockStarts[i]","; N++; end=$7;} break; } print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;} if($6=="+" && $8<$3) {start=$8;end=$3; for(i=1;i<=blockCount;i++) if(($2+blockStarts[i])>$8) { if(start==0) {A=blockSizes[i];B=0; start=$2+blockStarts[i];} else {A=A","blockSizes[i];B=B","($2+blockStarts[i]-start);} N++; } else { if(($2+blockStarts[i]+blockSizes[i])>$8) { A=($2+blockStarts[i]+blockSizes[i]-$8);B=0; N++; start=$8;} if(($2+blockStarts[i]+blockSizes[i])==$8) start=0;} print $1,start,end,$4,$5,$6,start,end,$9,N,A,B;}}'

Update: Thanks to Heather's comment below, when the thickEnd (end of CDS) is at the end of an exon (or thickStart at the first nt of an exon), it will generate an invalid BED12 format. It's fixed now. (2017-Jan-28)

BTW, the above codes are integrated into a neat script in Github: https://github.com/sterding/RNAseq/blob/master/bin/bed12toAnnotation.awk

Also, I fixed some bugs in gtf2bed (originally by Erik) and host here:
https://github.com/sterding/RNAseq/blob/master/bin/gtf2bed

Monday, October 01, 2012

code snip to decide Phred encoding of FASTQ file

It's a common task to check the Phred quality score encoding of fastq file. Sure, it's fairly easy to check this by eyes (according to the rul below), and also there might be already some tools for that purpose. The task is essentially a problem of converting ASCII character into its decimal format.

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

To be simple, I just wrote a line of code for that, by only looking the first lines of reads. Here it is:

head input.fastq | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding!";}'

To use it, you can also save it into a bash, like

#!/bin/sh

inputfile=$1

# Update: The following part for checking the file extension can be simplified (Thanks to the comment from Unknown)
~~[[ "$inputfile" =~ ".*\.(fq\.gz$|fastq\.gz$)" ]] && zcat $inputfile | head -n40 ...~~
~~[[ "$inputfile" =~ ".*\.(fq$|fastq$)" ]] && head -n40 $inputfile | ...~~

less $inputfile | head -n40 | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding!";}'

The only trick is the command od, which is to display/convert file content in a specific format. Here is the more detail for od (http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds4/od.htm).

Tuesday, September 25, 2012

Number formatting with thousand separator

in BASH : (Ref: http://stackoverflow.com/questions/9374868/number-formatting-in-bash-with-thousand-separator)

$ printf "%'.3f\n" 12345678.901
12,345,678.901
$ printf "%'d\n" 12345678
12,345,678

in AWK: (Ref: http://www.mkssoftware.com/docs/man1/awk.1.asp)

When flag contains a ' or " character and the TK_USE_CURRENT_LOCALE environment variable is set, a thousands separator is displayed. The digital grouping character (for example, a comma in the United States) as set by the Regional and Language Options control panel applet is used as the thousands separator. When flags contains ' or " and TK_USE_CURRENT_LOCALE is unset, no thousands separator is displayed.

For example, the following MKS KornShell commands:

export TK_USE_CURRENT_LOCALE=1
awk 'BEGIN { printf("%'\''10d\n",123456)}'

display:

123,456

while the MKS KornShell commands:

unset TK_USE_CURRENT_LOCALE
awk 'BEGIN { printf("%'\''10d\n",123456)}'

display:

Tuesday, August 14, 2012

convert bed to gtf, gtf to bed

bed --> gtf:

bedToGenePred input.bed stdout | genePredToGtf file stdin output.gtf

Here is an example from UCSC wiki (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format):

Some gene tracks are in a bed format in the database, perhaps with extra columns past the standard bed format. In this case, extract the standard bed columns, convert it to a genePred and then to a gtf. For example

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd from wgRna;" hg19 | bedToGenePred stdin stdout | genePredToGtf file stdin wgRna.gtf

gtf --> bed:

download gtf2bed.pl from Eric's site:

http://code.google.com/p/ea-utils/source/browse/trunk/clipper/gtf2bed

perl gtf2bed.pl input.gtf > output.bed

If simply converting gtf to bed6 format (no exons/intron info), the following bash line can work:

fgrep -w transcript gencode.v17.annotation.gtf | sed 's/[";]//g;' | awk '{OFS="\t"; print $1, $4-4,$5,$12,0,$7}'

or a bed9 format with additional info on gene/type/name

fgrep -w transcript gencode.v17.annotation.gtf | sed 's/[";]//g;' | awk '{OFS="\t"; print $1, $4-4,$5,$12,0,$7,$18,$14,$10}'

Also, I edited Eric's code a bit to output a more meaningful gene ID, e,g.

($transid) = $f[8]=~ /transcript_id "([^"]+)"/;
($geneid) = $f[8]=~ /gene_id "([^"]+)"/;
($gene_type) = $f[8]=~ /gene_type "([^"]+)"/;
($gene_name) = $f[8]=~ /gene_name "([^"]+)"/;
($trans_type) = $f[8]=~ /transcript_type "([^"]+)"/;
$id="${gene_name}__${geneid}__${transid}__${gene_type}.${trans_type}";

Pages