## Sunday, August 16, 2015

### Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))

library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
laply(yvars, function(y) {
summary.lm(aov(y~x))\$r.squared
})
})
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))\$`Pr(>F)`[1]
})
})
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

require(pheatmap);
pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

```> summary.lm(results)

Call:
aov(formula = weight ~ group)

Residuals:
Min      1Q  Median      3Q     Max
-1.0710 -0.4180 -0.0060  0.2627  1.3690

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591```

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

`> F=summary.lm(results)\$fstatistic`
`> F=as.numeric(F)`
`> pf(F[1],F[2],F[3])`

Below table is a nice summary the methods applicable to corresponding data type.

 PREDICTOR VARIABLE (S) OUTCOME VARIABLE Categorical Continuous Categorical Chi Square, Log linear, Logistic t-test, ANOVA (Analysis of Varirance), Linear regression Continuous Logistic regression Linear regression,  Pearson correlation Mixture of Categorical and Continuous Logistic regression Linear regression, Analysis of Covariance
Ref:http://www.tulane.edu/~panda2/Analysis2/sidebar/stats.htm

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html

## Wednesday, August 12, 2015

### use getline to capture system command output in awk

I just learnt this today: awk also has its pipe (|) and getline, just like unix. If I want to call a system command in awk and capture its output (Note: system() won't work as it only return the exit status), I can use pipe the output to getline.

For example,

\$cat > test.txt
aa bb cc
11 22 33
44 55 cc

\$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd | getline a; print a}'
55

or

\$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; system(cmd);}'
55

Note that if only use cmd, it won't print out anything, because cmd itself won't switch to console (unlike system).

\$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd;}'

If you want to output the multiple lines and process them in awk, you can do

\$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | cut -f2 -d\" \""; while( (cmd | getline a) >0) print a;}'
55
bb

Reference: http://stackoverflow.com/questions/1960895/awk-assigning-system-commands-output-to-variable

## Wednesday, August 05, 2015

### Dopamine, L-DOPA, catecholamines, TH, Nuur1 etc.

catecholamines (儿茶酚胺) is the collection name for three neurotransmitters: epinephrine (adrenaline), norepinephrine (noradrenaline) and dopamine.

L-DOPA is its precursor. Its corresponding drug is called levodopa.

L-DOPA is produced from the amino acid L-tyrosine by the enzyme tyrosine hydroxylase (TH).

Tyrosine cannot pass the BBB (blood brain barrier), but L-DOPA can.

L-DOPA can be converted to dopamine, which can be further converted to epinephrine (adrenaline), norepinephrine (noradrenaline).

Dopaminergic neurons can be divided into 11 cell groups according to where they located, using histochemical fluorescence method. It includes A8-A16, Aaq and Telencephalic group. Among that, A9 is where the substantia nigra pars compacta (SNpc) refers.

Number of TH-positive neurons in SN declines with age, while a-synuclein level increases with age. This inverse relationship is also seen in the surviving DA neurons of PD patients.

And it shows that the down-regulation of TH is linked to a reduction of expression of transcription factor called Nurr1 (orphan nuclear receptor, encoded by gene NR4A2).

Nurr1 is shown to regulate a category of nuclear-encoded mitochondrial genes. (http://www.ncbi.nlm.nih.gov/pubmed/23341612)

## Tuesday, July 28, 2015

Lappalainen et al. Nature 2013 : Transcriptome and genome sequencing uncovers functional variation in humans, http://dx.doi.org/10.1038/nature12531

‘t Hoen et al. Nature Biotechnology 2013: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories, http://dx.doi.org/10.1038/nbt.2702

1. how to detect sample outlier?
a. before alignment: distance of k-mer profile
b. after alignment: Spearman rank correlation between samples --> D-statistics (i.e.  the median correlation of one sample against all the other samples)
c. gender mismatch: XIST vs. chrY
d. ASE bias rate among heterozygous sites

2. eQTL
a. exon/gene quantification
b. filter out lowly expressed ones (e.g. 0 in >50% samples)
c. for each group, normalize with PEER, adding mean
c1. use subset (??e.g. chr20, or chr20- 22??) using K=????0,1,,3,,57,10,13,15,20 for each dataset
c2. run eQTL and number of genes (eGenes) for each K.
c3. get the optimal K = K(with most number of eQTL genes)
c4. run PEER on 20,000 exons to get covairantes for the final normalization
c5. final PEER normalization using all dataset, residual + mean as final quantification
d. transform the final quantification to standard normal distribution (by ?)
e. eQTL using Matrix-eQTL: linear regression of quantification ~ genotypes + genotype_covariates
3. Differential expression analysis
a. TMM normalization (from edgeR)
b. filter: genes with more than 5 counts per million in at least 1 sample were analyzed in pairwise population comparisons
c. tweeDEseq (good for large samples), significance: FDR < 0.05 and log2 fold change greater than 2

## Wednesday, June 24, 2015

### median filter in AWK

Here is what median filter does from wikipedia:
To demonstrate, using a window size of three with one entry immediately preceding and following each entry, a median filter will be applied to the following simple 1D signal:
x = [2 80 6 3]
So, the median filtered output signal y will be:
y[1] = Median[2 2 80] = 2
y[2] = Median[2 80 6] = Median[2 6 80] = 6
y[3] = Median[80 6 3] = Median[3 6 80] = 6
y[4] = Median[6 3 3] = Median[3 3 6] = 3
i.e. y = [2 6 6 3].
Note that, in the example above, because there is no entry preceding the first value, the first value is repeated, as with the last value, to obtain enough entries to fill the window. This is one way of handling missing window entries at the boundaries of the signal.
Here is my awk code to implement this:
#!/bin/awk -f
# awk script to filter noise by sliding a window and taking mean or median per window
# Authos: Xianjun Dong
# Date: 2015-06-23
# Usage: _filter.awk values.txt
# bigWigSummary input.bigwig chr10 101130293 101131543 104 | _filter.awk -vW=5 -vtype=median
BEGIN{
if(W=="") W=5;
if(type=="") type="median";
half=(W-1)/2;

{
for(i=1;i<=NF;i++) {
array[half+1]=\$i
for(k=1;k<=half;k++){
array[half+1-k]=(i<=k)?\$1:\$(i-k);
array[half+1+k]=((i+k)>NF)?\$NF:\$(i+k);
}
if(type=="median") {asort(array); x=array[half+1];}
if(type=="mean") {x=0; for(j=1;j<=W;j++) x+=array[j]; x=x/W;}
printf("%s\t", x);
}
print "";
}

## Friday, May 22, 2015

### Be cautious of using grep --colour=always

I noticed that grep --colour=always can embed additional code (for coloring, for example, ^[[01;31m^[[K    ^[[m^[[K) in your text, which could lead downstream confusion. Here is one such example:
https://stackoverflow.com/questions/18315380/invisible-0131-number-in-bash

\$ echo 1 2 3 | grep 2
1 2 3
\$ echo 1 2 3 | grep 2 | cat -v
1 2 3
\$ echo 1 2 3 | grep --colour=always 2
2 3
\$ echo 1 2 3 | grep --colour=always 2 | cat -v
1 ^[[01;31m^[[K2^[[m^[[K 3

See my answer for more detail.

The solution is to use `--colour=auto`or not use the `--colour` option at all.

\$ echo 1 2 3 | grep --colour=auto 2
2 3
\$ echo 1 2 3 | grep --colour=auto 2 | cat -v
1 2 3

Check your .bashrc to see if you set alias grep='grep --colour=always' like I did.

## Thursday, May 21, 2015

### Which version should I install? 32 bit or 64 bit? Ubuntu or CentOS?

I was posting two relevant articles on this topic:

Unix vs. Linux
(http://onetipperday.blogspot.com/2014/07/unix-vs-linux.html)
Which version of tools I should install? i386 vs. x86_64, and 32bit vs. 64bit kernel (http://onetipperday.blogspot.com/2013/02/software-vs-hardware-i386-vs-x8664-and.html)

However, this can still be confusing sometimes... for example, when you visit the SRA toolkit download page: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

OK. Here is the short answer.

1. How to tell my CPU architecture is 32 bit or 64 bit?

\$ uname -aIf the output is i386 or i586, it's 32 bit. If it's x86_64, it's 64 bit.

Ref: http://www.cyberciti.biz/faq/linux-how-to-find-if-processor-is-64-bit-or-not/

2. How to tell my OS (operation system) kernel is 32 bit or 64 bit?

Note that the OS kernel can be different from the above hardware architecture. Usually the OS changes version following the update of hardware, but the 32bit version can work in a 64bit machine (not the other way around), so it might not be consistent. The Macbook Pro I am using is running a 32-bit kernel on a 64-bit processor (see my previous post).

3. How to tell which OS I am using? CentOS or Ubuntu?

In my case, I used:
\$ ls -d /etc/[A-Za-z]*[_-][rv]e[lr]* | grep -v "lsb" | cut -d'/' -f3 | cut -d'-' -f1 | cut -d'_' -f1

it returns redhat, which is part of CentOS now.

## Friday, May 15, 2015

### How to correctly set color in the image() function?

Sometimes we want to make our own heatmap using image() function. I recently found it's tricky to set the color option there, as its manual has very little information on col:

I posted my question on BioStars. The short answer is: Unless the breaks is set, the range of Z is evenly cut into N intervals (where N = the length of color) and values in Z are assigned to the color of corresponding interval.

For example, when x=c(3,1,2,1) and col=c("blue","red",'green','yellow'), the minimal of x is assigned as the first color, and max to the last color. Any value between is calculated proportionally to a color. In this case, 2 is the the middle one, according to the principal that intervals are closed on the right and open on the left, it's assigned to "red". So, that's why we see the colors are yellow-->blue-->red-->blue.

```x=c(3,1,2,1)

image(1,1:length(x), matrix(x, nrow=1, ncol=length(x)), col=c("blue","red",'green','yellow'))```

In practice, unless we want to manually define the color break points, we just set the first and last color, it will automatically find colors for the values in Z.

collist<-c(0,1)
image(1:ncol(x),1:nrow(x), as.matrix(t(x)), col=collist, asp=1)

If we want to manually define the color break points, we need to

x=matrix(rnorm(100),nrow=10)*100
xmin=0; xmax=100;
x[x<xmin]=xmin; x[x>xmax]=xmax;
collist<-c("#053061","#2166AC","#4393C3","#92C5DE","#D1E5F0","#F7F7F7","#FDDBC7","#F4A582","#D6604D","#B2182B","#67001F")
ColorRamp<-colorRampPalette(collist)(10000)
ColorLevels<-seq(from=xmin, to=xmax, length=10000)
ColorRamp_ex <- ColorRamp[round(1+(min(x)-xmin)*10000/(xmax-xmin)) : round( (max(x)-xmin)*10000/(xmax-xmin) )]
par(mar=c(2,0,2,0), oma=c(3,3,3,3))
layout(matrix(seq(2),nrow=2,ncol=1),widths=c(1),heights=c(3,0.5))
image(t(as.matrix(x)), col=ColorRamp_ex, las=1, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
image(as.matrix(ColorLevels),col=ColorRamp, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
axis(1,seq(xmin,xmax,10),seq(xmin,xmax,10))

## Friday, May 08, 2015

### Tips and Tools you may need for working on BIG data

Nowadays everyone is talking about big data. As a genomic scientist, I could feel hungry of a collection of tools more specialized for the mediate-to-big data we deal everyday.

Here are some tips I found useful when getting, processing or visualizing large data set:

We can use wget to download the data to local disk. If it's large, we can download with other faster alternative, such as axel, aria2.

2. Process the data in parallel with hidden option in GNU commands

• If you have many many files to process, and they are independent, you can process them in a parallel manner. GNU has a command called parallel. Lindenbaum Pierre wrote a nice notebook for "GNU Parallel in Bioinformatics", worthy to read.
• Many commonly used commands also have a hidden option to run in a parallel way. For example, GNU sort command has --parallel=N to set it with multiple cores.
• You can set -F when doing grep -f on a large seed file. People also suggest to set export LC_ALL=C line to get X2 speed.

3. In R, there are several must-have tips for large data, e.g. data.table
• If using read.table(), set stringsAsFactors = F and colClass. See the example here
• use data.table, rather data.frame. Learn the difference online here.
• There is a nice View for how to process data in parallel in R: http://cran.r-project.org/web/views/HighPerformanceComputing.html, but I have not followed them practically. Hopefully there will be some easy tutorials there, or I become less procrastinated to learn some of them ... At least I can start with foreach()
4. How to open scatter plot with too many points in Illustrator?

This is really a problem for me as we usually have a figure with >30k dots (i.e. each dot is a gene). Even though they are highly overlapping each other, opening it in Illustrator is extremely slow. Here is a tip: http://tex.stackexchange.com/questions/39974/problem-with-a-very-heavy-eps-image-scatter-plot-too-heavy-as-eps
From that, probably a better idea is to "compress" the data before plotting it, such as merge the overlapped ones if they overlapped some %.
or this one:
http://stackoverflow.com/questions/18852395/writing-png-plots-into-a-pdf-file-in-r
or this one:
http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points

Still working on the post...

## Thursday, May 07, 2015

### A clarity on the Illumina TruSeq Small RNA prep kit manual

In the TruSeq® Small RNALibrary Prep Guide, below the Figure 1, there is a sentence: "The RNA 3' adapter is modified to target microRNAs and other small RNAs that have a 3' hydroxyl group resulting from enzymatic cleavage by Dicer or other RNA processing enzymes." It's right, but could be very misleading if you are not clear of the diverse picture of transcriptome (scroll down for more detail). I want to emphasize that the 3' hydroxyl group (and the 5'-phosphate group) is NOT specific to microRNAs or any small RNAs. And it doesn't necessarily result from enzymatic cleavage by Dicer. Sonic fragmentation can also break the full length mRNA (with 5'-cap and 3'-polyA) into truncated RNA pieces with 5'-phosphate and 3' hydroxyl free ends. I just called Illumina to confirm that the 3' and 5' ligation steps don't guarantee the selection of miRNAs (but rather any RNAs with 5'-phosphate and 3' hydroxyl ends, if more accurately). The last step of gel purification is the key to select (or enrich, if more accurately) miRNAs.

OK. Here is what I learned from my colleagues about the different RNA species in the trancriptome:

There are 4 species in the transcriptome, where the later 3 are intermediates of transcription (or half product of degradation).
• me7Gppp-------------------------3' (1)
•                 p------------- 3' (2)
•                 OH-------------3' (3)
•        ppp---------------------3' (4)
Only group(2) will ligate to 5’adaptor. The 3' end can also have different format, at least two:
• 5' ----------------- AAAAAA (1)
• 5' ---------- OH (2)
Also note there are two enzymes used to repair the 5' ends: CIP and TAP. CIP (Calf Intestinal alkaline phosphatase) can remove the 5’ phosphate group of DNA strand. The TAP (Tobacco acid pyrophosphatase) is to remove the 5' cap structure (or 5'-5' triphosphate linkage) and leave a mono-phosphate at the 5' end. So, applying first CIP and then TAP will convert the above (2) and (4) to (3), then convert (1) to (2). That's one way to do capture the 5' cap structure, same purpose as CAGE (but CAGE use the type IIs restriction enzyme MmeI and type III restriction enzyme EcoP15I)

## Friday, March 20, 2015

### grep -wfF list.txt input.txt

If you are just greping the list from a file, and your list are store in a file, let's say, list.txt, then you can always do grep -wf list.txt input.txt

When list.txt is huge, "-F" will be much faster.

Extracted from https://www.biostars.org/p/134753/#134871

## Wednesday, March 11, 2015

### X11 connection error in Mac

Typically, I log into my remote server/cluster via "ssh -X" and from there launch R program for plotting. But it always shows an error as

unable to open connection to X11 display ''

after a while, when you want to call functions such as plot().

This is very annoying. So that I have to exit the server and re-login again.

Does this sound familiar for you?

Here is the solution I found via website below:
http://b.kl3in.com/2012/01/x11-display-forwarding-fails-after-some-time/

Two ways to solve this:

1. add the following line to the Mac client’s /etc/ssh_config:
ForwardX11Timeout 596h
2. use “ssh -Y <remote system>”, instead of -X, as it may not trigger the untrusted auth timeout.

## Thursday, March 05, 2015

### How to extract the gap region in human genome?

Just notice that I should avoid the gap region, esp. when we generate a random background as your null distribution using tools such as bedtools shuffle.

Short answer: go below UCSC Table Browser link and choose to save as a bed file

As below table shown, 8.28% of hg19 assembly are simply gap.

Gap (gap) Summary Statistics
 item count 457 item bases 239,845,127 (8.28%) item total 239,845,127 (8.28%) smallest item 47 average item 524,825 biggest item 30,000,000

## Thursday, February 26, 2015

### reshape: from long to wide format

This is to continue on the topic of using the melt/cast functions in reshape to convert between long and wide format of data frame. Here is the example I found helpful in generating covariate table required for PEER (or Matrix_eQTL) analysis:

Here is my original covariate table:

Let's say we need to convert the categorical variables such as condition, cellType, batch, replicate, readLength, sex into indicators (Note: this is required by most regression programs like PEER or Matrix-eQTL, since for example the batch 5 does not match it's higher than batch 1, unlike the age or PMI). So, we need to convert this long format into wide format. Here is my R code for that:

library(reshape2)
categorical_varaibles = c("batch", "sex", "readsLength", "condition", "cellType", "replicate");
for(x in categorical_varaibles) {cvrt = cbind(cvrt, value=1); cvrt[,x]=paste0(x,cvrt[,x]); cvrt = dcast(cvrt, as.formula(paste0("... ~ ", x)), fill=0);}

Here is output:

## Monday, January 12, 2015

### altColor in UCSC track setting

Many track types allow setting a color range that varies from color to altColor. For instance the CpG Island tracks use the altColor setting to display the weaker islands, while the stronger ones are rendered in color. If altColor is not specified, the system will use a color halfway between that specified in the color tag and white instead.

Be aware that wiggles with negative values are drawn in altColor not color as positive values are.

### Using one line command as input for LSF bsub

In a simple case, you can use bsub command arguments to submit your command job to LSF cluster.

If you have a complicated script with many commands,  you can save into a lsf script (including the shell pathname in the first line) and then submit that script to LSF cluster, e.g.  bsub yourscript arguments

In your script, you wrote something like this:

#!/bin/bash
myFirstArgument = \$1

Here I found I can also use pipe to connect multiple commands into one line and simply quote them as one command and works in bsub. Here is an example:

bsub "echo -ne 'ab\tcss' | awk '{print \\$2}'"

So far, I found I have to add "\" (backslash) to escape the special character, such as \$ in awk. Wondering there might be a way in bsub options to set this.

## Thursday, October 23, 2014

### Hierarchical structure of UCSC Genome Browser track hub

UCSC Genome Browser tracks can be organized into groups by using the container multiWig, compositeTrack on, and superTrack on lines. Supertracks can contain composite tracks and container multiWigs, but not vice versa. With supertracks, composite tracks, and container multiWigs, children will inherit the settings from their parents, but can override their parent settings within their own stanzas.

Here is their hierarchical relationship:

superTrack on
|==== child tracks (bam, bigBed, bigWig or vcfTabix)
|---- child track (bigWig)
|==== view
|---- child track with subGroups setting (bam, bigBed, bigWig or vcfTabix, but not mix)
|---- child track with subGroups setting (bam, bigBed, bigWig or vcfTabix, but not mix)

superTrack also allow to mix other supported types of hub tracks: bam, bigBed, bigWig or vcfTabix.

## Monday, October 20, 2014

### Google doc + Flow: an imperfect online reference management solution

Working on a manuscript simultaneously with your collaborators in Google doc is such a fun thing!

We usually download the manuscript into Word (*.doc) when it's nearly done and insert references in MS Word with external software such as Papers, EndNote etc. Most of the reference managers are not free. Now, this pain is over!

Solution: use Flow!

Flow supports to save the references from website to your Flow account and then from your Google doc Add-on, you can insert and format citations easily.

Here is the steps:
1. Set up a Flow account (you can also create a collection and share with your collaborator)
2. Import references from web sites (or other tools such as Mendeley etc.): http://help.flow.proquest.com/customer/portal/articles/1087119-save-articles-directly-from-the-web-using-save-to-flow
3. Done!

## Friday, October 10, 2014

### Memory issue for large alignment file when doing de novo assembly

Memory becomes a bottleneck when the sequencing file is bigger and bigger nowadays. This is specially an issue for de novo transcriptome assembly using RNA-seq data from species like human. For Trinity, "a typical configuration is a multi-core server with 256 GB to 1 TB of RAM". "Trinity partitions RNA-Seq data into many independent de Bruijn graphs, ideally one graph per expressed gene, and uses parallel computing to reconstruct transcripts from these graphs, including alternatively spliced isoforms." If Trinity contructs one de Bruijn graph per gene, I don't know why it still needs such a large memory. For Cufflinks (-g option), I already can see it consumes 40G memory for 1/10 of chr1 (given a 4G bam file). Cufflinks constructs a DAG in memory for the given alignment. Hopefully it's one DAG per chromosome, not a DAG for the whole genome. But even though, chr1 has 249 billion base pairs, it still requires a lot of memory... So sad!

Of course, if you have a machine with super large memory, this won't be a problem. But this is not the case usually. For me, most of our nodes have maximally 90G memory.

So, how to have it run?

Here is few tips I can think of:

1. split the alignment into different chromosome, e.g. chr1.bam, chr2,bam etc. and then call de novo on each of files.

2. down-sampling the bam file if it has high coverage. Here is a post about how to down-sample your bam file (https://www.biostars.org/p/4332/). Basically, you can use "samtools view -s" or "sambamba view -s", or GATK's randomSampleFromStream.pl for downsampling.

3. You may also want to remove the PCR artifacts, by "samtools rmdup" or "sambamba markdup".

## Wednesday, October 08, 2014

https://www.biostars.org/p/110289/

And too bad that no one from the Texedo group ever threw a piece of clue!

Here are few tips I found necessary to share in order to have it work:

1. The mask GTF file should have all 9 fields in required format. For example, the strand column should be '+', '-', or '.', not anything else. GTF/GFF file can be extracted from GENCODE (http://www.gencodegenes.org/) or downloaded from UCSC Table browser. It can be also converted from a bed file using Kent's bedToGenePred --> genePredToGtf. But be aware that that the bed file should have at least 6 columns (i.e. including strand column), otherwise the converted GTF file will have a "^@" in the strand column, which results in an invalid GTF.

For example, if you want to exclude all reads mapped to human mitochondrial genome,  you can use
echo "chrM 0 16571 mt 0 ." | bedToGenePred stdin stdout | genePredToGtf file stdin chrM.gtf

2. "-M" option also works for de novo assembly (cufflinks -g).

3. Using "-M" option should theoretically increase the FPKM value (comparing to no mask). So, if you observed opposite tread, there must be something wrong.

4. If you expect a lot of reads from the mask regions (e.g. chrM, rRNAs), you can substract the masked reads from your bam file before feeding to cufflinks, for example using "samtools view -L retained_region.bed".