Tuesday, April 15, 2014

Trimmed mean and median in AWK

This can be easily done in R, but sometime you want to get it in scripting language like awk or perl in order to process the big data line by line.

Here is the code snip:

# for median
function median(v) 
    if (c % 2) return j[(c+1)/2]; 
    else return (j[c/2+1]+j[c/2])/2; 

# for trimmed mean (where p is the percentage of data to be trimmed)
function trimmedMean(v, p) 
    for(i=a+1;i<=(c-a);i++) s+=j[i];
    return s/(c-2*a); 
To use it, for example if we want to generate a merged bigwig track for a list of samples, we can take median value of all samples at each genomic position, here is it:

unionBedGraphs -i `ls *normalized.bedGraph` | awk 'function median(v) {c=asort(v,j); if (c % 2) return j[(c+1)/2]; else return (j[c/2+1]+j[c/2])/2.0; } {OFS="\t"; n=1; for(i=4;i<=NF;i++) S[n++]=$i; print $1,$2,$3, median(S)}' > AllSamples.median.normalized.bedGraph
bedGraphToBigWig AllSamples.median.normalized.bedGraph ChromInfo.txt

Tuesday, March 11, 2014

Confusing dbSNP ID

There have been enough confusion in ID or naming system in the popular biological databases. I mentioned one such example in an earlier post, where two genes (on totally different chromosomes) use the same symbol and the same gene has different Ensembl Gene ID. Other confusions include the Entrez ID vs. gene symbol vs. Ensembl ID. People who do GO analysis using tools like DAVID must have painful experience on this (btw, I don't understand why most GO analysis tools convert to Entrez ID at the end).

And now, I got another example of confusing ID system in biology or bioinformatics.

You expect each SNP should have their unique ID in the dbSNP database, right? But they are not. Look at this example:

chr1    13837   13838   -      rs7164031, rs79531918
chr1    13837   13838   +     rs200683566, rs28391190, rs28428499, rs71252448, rs79817774

At the same location, multiple SNPs ID are assigned, for both strands. 

Here is what I got from NCBI User service for the explanation:

This is expected for many reasons:
- short probes could have multiple mapping locations on the genome
- certain genes could have duplicate/pseudogene/paralogs
- variations found in repeat regions would be difficult to map to a unique 

An rsID represents a cluster of reported variations submitted to dbSNP.
In ideal situation, they can be mapped to a unique location in the genome. 
There are cases where such unique mapping is NOT attainable.

But I am still not clear of how a cluster of variants are defined. [UPDATE to add]

p.s. code snip to merge SNPs with the same location into single line:
grep single snp137.bed | sort -k1,1 -k2,2n -k6,6 | bedtools groupby -g 1,2,3,6 -c 4 -o collapse

Friday, January 31, 2014

Access Google Spreadsheet directly in bash and in R

Google Doc is a good way to share/manage documents between you and your colleagues, but sometime you want to directly access the data in terminal (e.g. bash) or in program (e.g. R), without downloading the data first.

For example, I have a Google Spreadsheet here:

I want to open it in R or select some of columns in bash. Here are the tips for that:

Step1: publish the tab you want to access to the web [howto], in format of CVS or TXT (which is a tab-delimited file actually)

Step2: copy the published URL. At the end, you will see "&output=txt" part for output format, and "&gid=0" to indicate which tab to access (if you have multiple tab, specify tab number here, which starts from 0).

Step3: To access the file externally. Done.

For example, to access the 1st and 3rd columns in bash:

wget --no-check-certificate -q -O - '' | cut -f1,3

In R,

covarianceTable=read.table(textConnection(getURL(covarianceTableURL)), header=T)

Monday, January 27, 2014

ssh to a server without asking password

I found this is the super easy way:

Step 1: Create public and private keys using ssh-key-gen on local-host
Step 2: Copy the public key to remote-host using ssh-copy-id
ssh-copy-id user@hostname.example.comor if you want to specify the public key file
ssh-copy-id -i ~/.ssh/

If you want to make the login even quicker, you can set an alias in the .bashrc, like
alias host="ssh"
Next time, you just need to type host, which will bring you directly to the host. Easy?

btw, if ssh-copy-id is not found for your system (e.g. default case in Mac OS), you can easily install it by using:
brew install ssh-copy-id

Install GNU in Mac OS

If you don't know what's GNU, check here ( and here (

Why Mac OS does not come with GNU? Here is what I extracted from Hong Xu's comment:
Because OS X is mainly BSD based -- the same reason why FreeBSD/OpenBSD/NetBSD does not use GNU tools by default. Another reason that Apple bundles many outdated GNU software (bash, gdb, etc.) is that the new GPLv3 doesn't allow Apple to do so, while GPLv2 is fine with this behavior. After many GNU projects upgraded to GPL v3, Apple won't be able to bundle them any more.
I found it's very easy to install GNU in Max OS just following Hong Xu's blog:

Also recommend the friendly tool of Homebrew:

Basically, here is step:

ruby -e "$(curl -fsSL"

brew install coreutils  # contain most of what you want
brew install binutils 
brew install diffutils 
brew install ed --default-names 
brew install findutils --default-names 
brew install gawk 
brew install gnu-indent --default-names 
brew install gnu-sed --default-names 
brew install gnu-tar --default-names 
brew install gnu-which --default-names 
brew install gnutls --default-names 
brew install grep --default-names 
brew install gzip 
brew install screen 
brew install watch 
brew install wdiff --with-gettext 
brew install wget

Friday, January 24, 2014

about UCSC Genome Browser track

Two points:

1. The custom track data may be compressed by any of the following programs: gzip (.gz), compress (.Z), or bzip2 (.bz2). But not for bigwig and bam.

2. In a track hub Db configuration file, up to 9 subgroup types can be defined for a composite, such as:

subGroup1 <gTag1> <gTitle1> <mTag1a=mTitle1a> [mTag1b=mTitle1b…]
subGroup2 <gTag2> <gTitle2> <mTag2a=mTitle2a> [mTag2b= mTitle2b…]
subGroup9 <gTag9> <gTitle9> <mTag9a=mTitle9a> [mTag9b= mTitle9b…]

But these is no such limitation (I guess so, not test yet) for the tag/title pairs in each subGroup. For example, ENCODE data trackDb put all TFs in one subGroup:

One question: How to share tracks but secure the data files in the track hub?

The current directory hierarchy for a hub is like:
myHub/ - directory containing track hub files

     hub.txt -  a short description of hub properties
     genomes.txt - list of genome assemblies included in the hub data
     hg19/ - directory of data for the hg19 (GRCh37) human assembly
          trackDb.txt - display properties for tracks in this directory
          dnase.html - description text for a DNase track 
          dnaseLiver.bigWig - wiggle plot of DNase in liver
          dnaseLiver.bigBed - regions of active DNase
          dnaseLung.bigWig - wiggle plot of DNase in lung
          dnaseLung.bigWig - regions of active DNase
          rnaSeq.html - description text for an RNAseq track
          rnaSeqLiver.bigWig - wiggle plot of RNAseq data in liver
          rnaSeqLiver.bigBed - intron/exon lists for liver
          rnaSeqLung.bigWig - wiggle plot of RNAseq data in lung
          rnaSeqLung.bigBed - intron/exon lists for lung
     hg18/ - directory of data for the hg18 (Build 36) human assembly
          trackDb.txt - display properties for tracks in this directory
          dnase.html - description text for a DNase track 
          dnaseLiver.bigWig - wiggle plot of DNase data in liver
          dnaseLiver.bigBed - regions of active DNase
          dnaseLung.bigWig - wiggle plot of DNase data in lung
          dnaseLung.bigWig - regions of active DNase
          rnaSeq.html - description text for an RNAseq track
          rnaSeqLiver.bigWig - wiggle plot of RNAseq data in liver
          rnaSeqLiver.bigBed - intron/exon lists for liver
          rnaSeqLung.bigWig - wiggle plot of RNAseq data in lung
          rnaSeqLung.bigBed - intron/exon lists for lung

The UCSC webpage also indicates that "unlisted hubs are in no way secure." But this is definitely a unsolved problem. Maybe the only solution is to set up your own local mirror?

Wednesday, January 15, 2014

How to see source code of a function/method in R?

If it's an internal function of R (e.g. from base package), just type the function name, like
> rowMeans
function (x, na.rm = FALSE, dims = 1L) 
    if ( 
        x <- as.matrix(x)
    if (!is.array(x) || length(dn <- dim(x)) < 2L) 
        stop("'x' must be an array of at least two dimensions")
    if (dims < 1L || dims > length(dn) - 1L) 
        stop("invalid 'dims'")
    p <- prod(dn[-(1L:dims)])
    dn <- dn[1L:dims]
    z <- if (is.complex(x)) 
        .Internal(rowMeans(Re(x), prod(dn), p, na.rm)) + (0+1i) * 
            .Internal(rowMeans(Im(x), prod(dn), p, na.rm))
    else .Internal(rowMeans(x, prod(dn), p, na.rm))
    if (length(dn) > 1L) {
        dim(z) <- dn
        dimnames(z) <- dimnames(x)[1L:dims]
    else names(z) <- dimnames(x)[[1L]]
<bytecode: 0x111390028>

<environment: namespace:base>

If it's a S4 function in a package, e.g. to see the code of plotMA in the DESeq2 package, first find the object it belongs to using showMethods(), like
> showMethods('plotMA')
Function: plotMA (package BiocGenerics)
    (inherited from: object="ANY")

then to get source code by getMethod()
> getMethod("plotMA","DESeqDataSet")
Method Definition:

function (object, ...) 
    .local <- function (object, lfcColname, pvalues, pvalCutoff = 0.1, 
        ylim, linecol = "#ff000080", pointcol = c("black", "red"), 
        xlab, ylab, log = "x", cex = 0.45, ...) 
        if (missing(xlab)) 
            xlab <- "mean of normalized counts"
        if (missing(ylab)) 
            ylab <- expression(log[2] ~ fold ~ change)
        if (!missing(pvalues)) {
            if (length(pvalues) != nrow(object)) {
                stop("length of pvalues should be equal to the number of rows of object")
        stopifnot(length(pointcol) == 2)
        if (!"results" %in% mcols(mcols(object))$type) {
            stop("first run DESeq() in order to produce an MA-plot")
        if (missing(lfcColname)) {
            lfcColname <- lastCoefName(object)
        if (length(lfcColname) != 1 | !is.character(lfcColname)) {
            stop("the argument 'lfcColname' should be a character vector of length 1")
        if (missing(pvalues)) {
            res <- results(object, name = lfcColname)
            pvalues <- res$padj
        x <- mcols(object)
        stopifnot(length(cex) == 1)
        col <- ifelse( | pvalues > pvalCutoff, 
            pointcol[1], pointcol[2])
        col = col[x$baseMean > 0]
        x = x[x$baseMean > 0, ]
        py = x[, lfcColname]
        if (missing(ylim)) 
            ylim = c(-1, 1) * quantile(abs(py[is.finite(py)]), 
                probs = 0.99) * 1.1
        plot(x$baseMean, pmax(ylim[1], pmin(ylim[2], py)), log = log, 
            pch = ifelse(py < ylim[1], 6, ifelse(py > ylim[2], 
                2, 20)), cex = cex, col = col, xlab = xlab, ylab = ylab, 
            ylim = ylim, ...)
        abline(h = 0, lwd = 4, col = linecol)
    .local(object, ...)
<environment: namespace:DESeq2>

target  "DESeqDataSet"
defined "DESeqDataSet"


Thursday, January 09, 2014

Is it an elephant or is it a shark?

I still remember when David first present elephant shark in a group meeting, I stopped him and asked naively: Excuse me, is it an elephant or is it a shark? :)

Of course, it's not an elephant. And now I know it's neither a shark.

Here is the today's Nature paper for the 'living fossil':

It says
"Our functional studies suggest that the lack of genes encoding secreted calcium-binding phosphoproteins in cartilaginous fishes explains the absence of bone in their endoskeleton. Furthermore, the adaptive immune system of cartilaginous fishes is unusual: it lacks the canonical CD4 co-receptor and most transcription factors, cytokines and cytokine receptors related to the CD4 lineage, despite the presence of polymorphic major histocompatibility complex classII molecules. It thus presents a new model for understanding the origin of adaptive immunity."

Here is the taxa from the paper:

And I could just find the one I made in my PhD thesis :)

Monday, December 09, 2013

process substitution

syntax error near unexpected token `('

I got the error message when I used command like:

program <(zcat myfile1.gz)<(zcat myfile2.gz)

in a bash script. It works well in my terminal, but when I submit the script to cluster and it always return the error message.

Finally I got the key from Mitch's blog:

It's called process substitution for piping the stdout of multiple commands. Process substitution is not a POSIX compliant feature and so it may have to be enabled via:

set +o posix

After I set the POSIX, it works well! It seems that POSIX is not enabled on the node where the job was submitted.

Friday, December 06, 2013

R / Bioconductor for High-Throughput Sequence Analysis

I would like to recommend a recent workshop material on R/Bioconductor from Marc Carlson et al.

PDF: IntermediateSequenceAnalysis2013.pdf
R script: IntermediateSequenceAnalysis2013.R

Here is the TOC list if you want to read more (highlight for suggested reading list):

Tuesday, December 03, 2013

Illumina HiSeq2000 adaptor and sequencing

I've referred the following material when making the figure:

Just one learning note:

If insert is not long enough (i.e. shorter than the read length), R1 will have contamination from Rd2 SP (e.g. AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC) and R2 will have contamination from reverse complementary of Rd1 SP (e.g. AGATCGGAAGAGCG)

So, basically you just need to check if the shared complementary part of Rd1 and Rd2 SP (which is AGATCGGAAGAGC) occurs in the reads. If yes, simply trim it and its following part (if any).

Note: if you don't understand the "shared complementary part", please refer my previous blog on Illumina adaptor. Here is the link:

Here is one solution of howto remove adaptor contamination:
1. save the complementary part into a fasta file, e.g. adaptor.fa
2. run fastq-mcf to remove adaptor
fastq-mcf -o filted -x 10 -l 15 -w 4 -u adaptor.fa input.fq.gz