Monday, January 12, 2015

altColor in UCSC track setting

Many track types allow setting a color range that varies from color to altColor. For instance the CpG Island tracks use the altColor setting to display the weaker islands, while the stronger ones are rendered in color. If altColor is not specified, the system will use a color halfway between that specified in the color tag and white instead.

Be aware that wiggles with negative values are drawn in altColor not color as positive values are.

Using one line command as input for LSF bsub

In a simple case, you can use bsub command arguments to submit your command job to LSF cluster.

If you have a complicated script with many commands,  you can save into a lsf script (including the shell pathname in the first line) and then submit that script to LSF cluster, e.g.  bsub yourscript arguments

In your script, you wrote something like this:

myFirstArgument = $1

Here I found I can also use pipe to connect multiple commands into one line and simply quote them as one command and works in bsub. Here is an example:

bsub "echo -ne 'ab\tcss' | awk '{print \$2}'"

So far, I found I have to add "\" (backslash) to escape the special character, such as $ in awk. Wondering there might be a way in bsub options to set this.

Thursday, October 23, 2014

Hierarchical structure of UCSC Genome Browser track hub

UCSC Genome Browser tracks can be organized into groups by using the container multiWig, compositeTrack on, and superTrack on lines. Supertracks can contain composite tracks and container multiWigs, but not vice versa. With supertracks, composite tracks, and container multiWigs, children will inherit the settings from their parents, but can override their parent settings within their own stanzas.

Here is their hierarchical relationship:

superTrack on
|==== child tracks (bam, bigBed, bigWig or vcfTabix)
     |---- child track (bigWig)
     |==== view
         |---- child track with subGroups setting (bam, bigBed, bigWig or vcfTabix, but not mix)
         |---- child track with subGroups setting (bam, bigBed, bigWig or vcfTabix, but not mix) 

superTrack also allow to mix other supported types of hub tracks: bam, bigBed, bigWig or vcfTabix.

Monday, October 20, 2014

Google doc + Flow: an imperfect online reference management solution

Working on a manuscript simultaneously with your collaborators in Google doc is such a fun thing!

How about inserting references?

We usually download the manuscript into Word (*.doc) when it's nearly done and insert references in MS Word with external software such as Papers, EndNote etc. Most of the reference managers are not free. Now, this pain is over!

Solution: use Flow!

Flow supports to save the references from website to your Flow account and then from your Google doc Add-on, you can insert and format citations easily.

Here is the steps:
1. Set up a Flow account (you can also create a collection and share with your collaborator)
2. Import references from web sites (or other tools such as Mendeley etc.):
3. Install Google Add-on for Flow:
3. Done!

Friday, October 10, 2014

Memory issue for large alignment file when doing de novo assembly

Memory becomes a bottleneck when the sequencing file is bigger and bigger nowadays. This is specially an issue for de novo transcriptome assembly using RNA-seq data from species like human. For Trinity, "a typical configuration is a multi-core server with 256 GB to 1 TB of RAM". "Trinity partitions RNA-Seq data into many independent de Bruijn graphs, ideally one graph per expressed gene, and uses parallel computing to reconstruct transcripts from these graphs, including alternatively spliced isoforms." If Trinity contructs one de Bruijn graph per gene, I don't know why it still needs such a large memory. For Cufflinks (-g option), I already can see it consumes 40G memory for 1/10 of chr1 (given a 4G bam file). Cufflinks constructs a DAG in memory for the given alignment. Hopefully it's one DAG per chromosome, not a DAG for the whole genome. But even though, chr1 has 249 billion base pairs, it still requires a lot of memory... So sad!

Of course, if you have a machine with super large memory, this won't be a problem. But this is not the case usually. For me, most of our nodes have maximally 90G memory.

So, how to have it run?

Here is few tips I can think of:

1. split the alignment into different chromosome, e.g. chr1.bam, chr2,bam etc. and then call de novo on each of files.

2. down-sampling the bam file if it has high coverage. Here is a post about how to down-sample your bam file ( Basically, you can use "samtools view -s" or "sambamba view -s", or GATK's for downsampling.

3. You may also want to remove the PCR artifacts, by "samtools rmdup" or "sambamba markdup".

Wednesday, October 08, 2014

Cufflinks mask option (-M/--mask-file) works when ...

Obviously I am not the only one who had questions on the "-M/--mask-file" mask GTF option in Cufflinks:

And too bad that no one from the Texedo group ever threw a piece of clue!

Here are few tips I found necessary to share in order to have it work:

1. The mask GTF file should have all 9 fields in required format. For example, the strand column should be '+', '-', or '.', not anything else. GTF/GFF file can be extracted from GENCODE ( or downloaded from UCSC Table browser. It can be also converted from a bed file using Kent's bedToGenePred --> genePredToGtf. But be aware that that the bed file should have at least 6 columns (i.e. including strand column), otherwise the converted GTF file will have a "^@" in the strand column, which results in an invalid GTF.

For example, if you want to exclude all reads mapped to human mitochondrial genome,  you can use
echo "chrM 0 16571 mt 0 ." | bedToGenePred stdin stdout | genePredToGtf file stdin chrM.gtf

2. "-M" option also works for de novo assembly (cufflinks -g).

3. Using "-M" option should theoretically increase the FPKM value (comparing to no mask). So, if you observed opposite tread, there must be something wrong.

4. If you expect a lot of reads from the mask regions (e.g. chrM, rRNAs), you can substract the masked reads from your bam file before feeding to cufflinks, for example using "samtools view -L retained_region.bed".

Friday, September 26, 2014

"search and highlight" in linux command line

Is there a way to display a text file in command line, but highlight the matches?

Actually there is a pretty neat tip from jacksonh:
grep --color -E "test|$" yourfile
What we're doing here is matching against the $ pattern and the test pattern, obviously $doesn't have anything to colourize so only the test pattern gets color. The -E just turns on extended regex matching.

Thursday, September 25, 2014

vennpieR: combination of venn diagram and pie chart in R

I was wondering how to draw a venn diagram like pie chart in R, to show the distribution of my RNA-seq reads mapped onto different annotation regions (e.g. intergenic, intron, exons etc.). A google search returns several options, including the nice one from Xiaopeng's bam2x (see below). However, he told me it's not released yet. And it's javascript based.

Why not I just make one in R? 

Here is the design scratch:

And here is example code:

Here is output:

You can also use par(mfrow=c(n,m)) to put multiple venn pieagram in one figure. 

Wednesday, September 24, 2014

external variables for GNU Parallel command

I was trying to use an externally defined variable within the parallel command, but it's failed. For example,

$ echo -e "intergenic\nintrons\nexons\n5utr\n3utr" | parallel 'echo {} $ANNOTATION/{}.bed'

3utr /3utr.bed
5utr /5utr.bed
intergenic /intergenic.bed
introns /introns.bed
exons /exons.bed

$ANNOTATION is not correctly read. One workout I found is to export the variable before the parallel, e.g. 

export ANNOTATION=/reference/annotation

Thursday, September 18, 2014

Simple script to generate random size-matched background regions

Here is a simple script I wrote for generating a random background regions with matched size distribution, using bedtools random and a input.bed as background. Not fancy, but works!

# ===============================================================
# Script to generate a random size-matched background regions 
# Author: Xianjun
# Date: Jun 3, 2014
# Usage:
# input.bed > output.random.bed
# or
# cat input.bed | -
# ===============================================================

# other possible options can be the genome (e.g. hg19, mm9 etc.) or an inclusion or exclusion regions (e.g. exons)

# download hg.genome
mysql --user=genome -A -e "select chrom, size from hg19.chromInfo"  > hg19.genome

cat $bedfile | while read chr start end rest
    let l=$end-$start;
    bedtools random -g hg19.genome -n 1 -l $l

Thursday, September 11, 2014

transpose a tab-delimited file in command line

Very often we need to transpose a tab-delimited file, e.g. rows --> columns and columns --> rows. For example, I have a SNP file like below, each row is SNP and each column is a sample:

$ cat SNP.txt
id Sam_01 Sam_02 Sam_03 Sam_04 Sam_05
Snp_01 2 0 2 0 2
Snp_02 0 1 1 2 2
Snp_03 1 0 1 0 1
Snp_04 0 1 2 2 2
Snp_05 1 1 2 1 1
Snp_06 2 2 2 1 1
Snp_07 1 1 2 2 0
Snp_08 1 0 1 0 1
Snp_09 2 1 2 2 0

I want to convert it to the following format:

id Snp_01 Snp_02 Snp_03 Snp_04 Snp_05 Snp_06 Snp_07 Snp_08 Snp_09
Sam_01 2 0 1 0 1 2 1 1 2 
Sam_02 0 1 0 1 1 2 1 0 1 
Sam_03 2 1 1 2 2 2 2 1 2 
Sam_04 0 2 0 2 1 1 2 0 2 
Sam_05 2 2 1 2 1 1 0 1 0

We can easily do this in R (e.g.. t(df)), but actually there are also a couple available tools in linux. Here are two I used:

1. rowsToCols from Jim Kent's utility
cat SNP.txt | rowsToCols stdin stdout

2. datamash from GNU
cat SNP.txt | datamash transpose

btw, datamash is really a neat command with many functions, like your swiss-knife for small daily tasks for data scientist. Here is its example page on GNU:

Wednesday, September 03, 2014

a bigWigSummary bug

Write down a bigwigsummary bug I found today. It's found when I attempted to get the max value in a region using -type=max and dataPoints=1:

$ bigWigSummary -type=max chr12 54070173 54072173 1
$ bigWigSummary -type=max chr12 54070173 54072173 10
0.944904 1.02475 0.568405 0.741671 1.43119 1.08896 0.705965 0.542034 0.380971 0.591934

As you see, if I use dataPoints=1, the max value is 13.3672 and when I use dataPoints=10 the max value is 1.43119. So there must be something wrong, since the max value should not change no matter how many data points we check. Visualizing the bigwig in UCSC Genome Browser shows that 1.43119 is correct for this case. Interestingly, 13.3672 is indeed a peak summit, but not for this region, rather a region upstream. I don’t know why bigWigSummary take the summit from region outside. This only happened when dataPoints=1. 

I put the data below. You can download to test:

People reported similar error for bigwigSummary, for example

bigWigSummary outputs different values as bigWigAverageOverBed: UCSC team explained "The reason for this is that the summary levels have some rounding error and some border conditions when extracting data over relatively small regions." They suggested to use bigWigAverageOverBed if you want the highest level of accuracy. But bigWigAverageOverBed won't output the max and also it's nothing with high or low accuracy, but rather a bug. 

Tao Liu also reported another bug for bigwig when it's converted from wig in compressed manner (by default), and suggested to fix it by using -unc when converting wig to bigwig. I've tried to use -unc when converting from bedGraph to bigwig, the bug is still there. 

Still looking for workout, and also report to UCSC:!topic/genome/pWjcov-xQyQ

Update: One workout I found is, to use intersectBed and groupBy in bedtools on bedGraph file (rather than bigwig). Here is pseudocode:

intersectBed -a regions.bed -b signal.bedGraph -wo -sorted | groupBy -g 1,2,3 -c 7 -o max > regions.maxSignal.bed

Update2:  use the -minMax option from the latest version of bigWigAverageOverBed (from v304). See reply from UCSC group:
Thank you for contacting us. One of our engineers was able to reproduce this and says the bigWigSummary program uses the bigWig summary levels, which are lower resolution versions of the data, so when you ask for a single value in a particular range, the range that is used may include bases that are before and after the range.
If you want only the values exactly within the range, you can use (the latest version from v304) bigWigAverageOverBed like so:
echo "chr12 54070173 54072173 one" | bigWigAverageOverBed stdin stdout -minMax | cut -f 8

Wednesday, July 30, 2014

note on web server

Note: Below tips are based on CentOS (member of RedHat family, incl. Fedora). For other systems, like Debian/Ubuntu/Mint/SUSE/Mac OS, this may not work.

1. How to add new users?

$ sudo vi /etc/group 
and find line with 
and add new username behind. 

* for an independent system (unlike ours which is a virtual server attached to the main cluster), you may want to use useradd <username> to add new users. See detail here.

2. How to Enable Userdir (Public_html)?

$ sudo vi /etc/httpd/conf/httpd.conf 
and find the following part and 
<IfModule mod_userdir.c>
    #UserDir disabled
    UserDir enabled xxx yyy zzz
    UserDir public_html

then restart apache by
$ sudo /etc/rc.d/init.d/httpd restart

Now you should be able to see the file content:

If you got a 403 error, you should change the folder/file permission to 0755 and 0644, respectively. 

3. How to change permission to a folder and all of its subfolders and files?

To set the directories to 755 and set the files to 644, you can use the find command. For example:
To change all the directories to 755 (-rwxr-xr-x):
find ~/public_html/myfolder -type d -exec chmod 755 {} \;
To change all the files to 644 (-rw-r--r--):
find ~/public_html/myfolder -type f -exec chmod 644 {} \;

4. How to set password for Userdir (Public_html)?

Change to and/or specify directory to protect in /etc/httpd/conf/httpd.conf:
<Directory /home/xxx/public_html>
     AllowOverride All
<Directory /home/xxx/public_html>
     AllowOverride AuthConfig
Create a file /home/xxx/public_html/.htaccess in that director that looks something like this:
     AuthName "Add your login message here."
     AuthType Basic
     AuthUserFile /home/xxx/public_html/.htpasswd
     AuthGroupFile /dev/null
     require user name-of-user
In this case the "name-of-user" is the login name you wish to use for accessing the web site.

Create (or clobber if it already exists) the password file /home/xxx/public_html/.htpasswd using the program htpasswd:
htpasswd -c .htpasswd name-of-user
Add a new user to the existing password file:
htpasswd .htpasswd name-of-user
Password file protection, ownership attributes:
File privileges: chmod ug+rw .htpasswd
File ownership: chown apache.apache .htpasswd

More method on encrypting a site can be found here:

Unix vs. Linux

I'd love to share a nice article about the difference between Linux and Unix (Believe me, not everyone knows the difference):

From there (also from Wiki), I learned that
  • Linux is just a kernel while Unix is a complete operating system. 
  • Unix was originally developed by Dennis Ritchie (also the creator of the C programming language) and Ken Thompson from Bell lab around 1970s, while the Linux kernel was written by a Finnish CS student Linus Torvalds in 1991. 
  • A typical Linux distribution (or GNU/Linux as Free Software Foundation calls) comprises a Linux kernel, GNU tools and libraries, additional software, documentation, a window system, window manager, and a desktop environment. 
  • There are 600+ Linux distributions (or GNU/Linux), with 300+ are in active development. 
  • Some popular mainstream Linux distributions include Debian, Ubuntu, Linux Mint, Fedora, openSUSE, Arch Linux, and the commercial Red Hat Enterprise Linux and SUSE Linux Enterprise Server. Their full relationship/timeline can be referred here:
  • Android is also Linux-based, but does not include a command-line interface and programs made for typical Linux distributions.
  • Mac OS is not linux-based. Its version 10, Max OS X is actually a Unix operating system. 
  • Other popular Unix systems include: HP-UX, IBM AIX, Sun Solairs, IRIX. They are based on different kernels. See details here:
  • iOS is also based on Mac OS X, therefore it's also a Unix OS.

Tuesday, July 15, 2014

best note on reshape2

I found this elegant note about reshape2 from Sean Anderson's blog:


reshape2 is based around two key functions: melt and cast:
melt takes wide-format data and melts it into long-format data.
cast takes long-format data and casts it into wide-format data.

For example, this is wide format:

> head(fpkm)
ID                  FPKM.SRR1069188 FPKM.SRR1070986 FPKM.SRR1071289
ENSG00000240361.1      1.00000000        1.000000        1.000000
ENSG00000186092.4      1.00000000        1.000000        1.000000
ENSG00000237613.2      1.00000000        1.000000        1.000000
ENSG00000239906.1      0.05888838        5.139312        5.055983
ENSG00000241860.1      1.20237363        1.160175        1.085992
ENSG00000222623.1      1.00000000        1.000000        1.000000

Using melt to change it into long format:

No id variables; using all as measure variables
         variable      value
1 FPKM.SRR1069188 1.00000000
2 FPKM.SRR1069188 1.00000000
3 FPKM.SRR1069188 1.00000000
4 FPKM.SRR1069188 0.05888838
5 FPKM.SRR1069188 1.20237363
6 FPKM.SRR1069188 1.00000000

or, you can set the column name by

> head(melt(fpkm, = "Sample", ="FPKM"))
No id variables; using all as measure variables
           Sample       FPKM
1 FPKM.SRR1069188 1.00000000
2 FPKM.SRR1069188 1.00000000
3 FPKM.SRR1069188 1.00000000
4 FPKM.SRR1069188 0.05888838
5 FPKM.SRR1069188 1.20237363
6 FPKM.SRR1069188 1.00000000

if you want, you can also keep some of columns as ID in the long format, for example, I want to keep the gene ID in the long format:

>head(melt(fpkm, = "Sample", ="FPKM", id="ID"))
                 ID          Sample       FPKM
1 ENSG00000240361.1 FPKM.SRR1069188 1.00000000
2 ENSG00000186092.4 FPKM.SRR1069188 1.00000000
3 ENSG00000237613.2 FPKM.SRR1069188 1.00000000
4 ENSG00000239906.1 FPKM.SRR1069188 0.05888838
5 ENSG00000241860.1 FPKM.SRR1069188 1.20237363
6 ENSG00000222623.1 FPKM.SRR1069188 1.00000000

I will do the long-->wide example when I have a good case to show... :)

Thursday, May 29, 2014

samtools flagstat won't output the number of reads mapped

I have been using samtools flagstat to get statistics summary of my BAM/SAM file, like many tutorial suggested (e.g. Dave Tang's note).

But then I noticed that flagstat reports the number of mapping (or alignments/hits, whatever you like to call), but not the number of reads mapped. This can be very different if your alignment contains multiple mappers (and this is especially true if you use default setting or without setting "-g 1" in Tophat). To get the number of reads mapped (e.g. in order to get mappability), you can use the following command:

samtools view -cF 0x100 accepted_hits.bam

Note: Even with multiple mappers, only one hit has flag of 0x100 in the output SAM file. I've already explained this in one of my previous posts.

Tuesday, April 15, 2014

Trimmed mean and median in AWK

This can be easily done in R, but sometime you want to get it in scripting language like awk or perl in order to process the big data line by line.

Here is the code snip:

# for median
function median(v) 
    if (c % 2) return j[(c+1)/2]; 
    else return (j[c/2+1]+j[c/2])/2; 

# for trimmed mean (where p is the percentage of data to be trimmed)
function trimmedMean(v, p) 
    for(i=a+1;i<=(c-a);i++) s+=j[i];
    return s/(c-2*a); 
To use it, for example if we want to generate a merged bigwig track for a list of samples, we can take median value of all samples at each genomic position, here is it:

unionBedGraphs -i `ls *normalized.bedGraph` | awk 'function median(v) {c=asort(v,j); if (c % 2) return j[(c+1)/2]; else return (j[c/2+1]+j[c/2])/2.0; } {OFS="\t"; n=1; for(i=4;i<=NF;i++) S[n++]=$i; print $1,$2,$3, median(S)}' > AllSamples.median.normalized.bedGraph
bedGraphToBigWig AllSamples.median.normalized.bedGraph ChromInfo.txt

Tuesday, March 11, 2014

Confusing dbSNP ID

There have been enough confusion in ID or naming system in the popular biological databases. I mentioned one such example in an earlier post, where two genes (on totally different chromosomes) use the same symbol and the same gene has different Ensembl Gene ID. Other confusions include the Entrez ID vs. gene symbol vs. Ensembl ID. People who do GO analysis using tools like DAVID must have painful experience on this (btw, I don't understand why most GO analysis tools convert to Entrez ID at the end).

And now, I got another example of confusing ID system in biology or bioinformatics.

You expect each SNP should have their unique ID in the dbSNP database, right? But they are not. Look at this example:

chr1    13837   13838   -      rs7164031, rs79531918
chr1    13837   13838   +     rs200683566, rs28391190, rs28428499, rs71252448, rs79817774

At the same location, multiple SNPs ID are assigned, for both strands. 

Here is what I got from NCBI User service for the explanation:

This is expected for many reasons:
- short probes could have multiple mapping locations on the genome
- certain genes could have duplicate/pseudogene/paralogs
- variations found in repeat regions would be difficult to map to a unique 

An rsID represents a cluster of reported variations submitted to dbSNP.
In ideal situation, they can be mapped to a unique location in the genome. 
There are cases where such unique mapping is NOT attainable.

But I am still not clear of how a cluster of variants are defined. [UPDATE to add]

p.s. code snip to merge SNPs with the same location into single line:
grep single snp137.bed | sort -k1,1 -k2,2n -k6,6 | bedtools groupby -g 1,2,3,6 -c 4 -o collapse

Friday, January 31, 2014

Access Google Spreadsheet directly in bash and in R

Google Doc is a good way to share/manage documents between you and your colleagues, but sometime you want to directly access the data in terminal (e.g. bash) or in program (e.g. R), without downloading the data first.

For example, I have a Google Spreadsheet here:

I want to open it in R or select some of columns in bash. Here are the tips for that:

Step1: publish the tab you want to access to the web [howto], in format of CVS or TXT (which is a tab-delimited file actually)

Step2: copy the published URL. At the end, you will see "&output=txt" part for output format, and "&gid=0" to indicate which tab to access (if you have multiple tab, specify tab number here, which starts from 0).

Step3: To access the file externally. Done.

For example, to access the 1st and 3rd columns in bash:

wget --no-check-certificate -q -O - '' | cut -f1,3

In R,

covarianceTable=read.table(textConnection(getURL(covarianceTableURL)), header=T)

Monday, January 27, 2014

ssh to a server without asking password

I found this is the super easy way:

Step 1: Create public and private keys using ssh-key-gen on local-host
Step 2: Copy the public key to remote-host using ssh-copy-id
ssh-copy-id user@hostname.example.comor if you want to specify the public key file
ssh-copy-id -i ~/.ssh/

If you want to make the login even quicker, you can set an alias in the .bashrc, like
alias host="ssh"
Next time, you just need to type host, which will bring you directly to the host. Easy?

btw, if ssh-copy-id is not found for your system (e.g. default case in Mac OS), you can easily install it by using:
brew install ssh-copy-id