Thursday, May 17, 2012

simple way to get reads length distribution of FASTQ files


  • Using perl: 
cat input.fq | perl -ne '$s=<>;<>;<>;chomp($s);print length($s)."\n";' > input.readslength.txt
  • Using awk:
cat input.fq | awk '{if(NR%4==2) print length($1)}' > input.readslength.txt
  • if zipped file, using:
 zcat input.fq.gz | ...
  • get length statistics:
sort input.readslength.txt | uniq -c

textHistogram


So, one line code for all input fastq files would be:

find *.fq.gz -not -name \*raw\* -printf "zcat %p | awk '{if(NR%%4==2) print length(\$1)}' | textHistogram -maxBinCount=59 stdin \n" | sh


Note that you have to use double % to escape the % character for printf formatting control, just like in C. (Thanks for zencuke's answer here)

You will get something like this:

RNAseq.20E_library.result_primary.clean.fa
large values truncated: need 35 bins or larger binSize than 1
Maximum value 53.000000
 18 ********* 27730
 19 ******************* 58997
 20 ************************************************************ 186919
 21 ************************ 74536
 22 ************** 45171
 23 ************* 39107
 24 *************** 45560
 25 *********************** 70452
 26 ************************************************* 154030
 27 ************************************************************ 187704
 28 ************************** 81198
 29 ***** 17016
 30 ** 5341
 31 * 2439
 32  0
 33  1
 34  0
 35  0
 36  0
 37  0
 38  0
 39  1
<minVal or >= 40  173

RNAseq.Day_13_library.result_primary.clean.fa
large values truncated: need 35 bins or larger binSize than 1
Maximum value 53.000000
 18 ***** 22570
 19 ********* 40335
 20 ***************************** 127999
 21 ********************* 95179
 22 ****************** 79808
 23 ********* 39596
 24 ******* 29423
 25 ************* 55438
 26 ************************************* 164868
 27 ************************************************************ 265722
 28 *************************** 120353
 29 ********* 38625
 30 *** 14684
 31 * 5214
 32  0
 33  0
 34  1
<minVal or >= 35  140

No comments:

Post a Comment