One Tip Per Day: simple way to get reads length distribution of FASTQ files

Thursday, May 17, 2012

simple way to get reads length distribution of FASTQ files

Using perl:

cat input.fq | perl -ne '$s=<>;<>;<>;chomp($s);print length($s)."\n";' > input.readslength.txt

Using awk:

cat input.fq | awk '{if(NR%4==2) print length($1)}' > input.readslength.txt

if zipped file, using:

zcat input.fq.gz | ...

get length statistics:

sort input.readslength.txt | uniq -c

or using fancy code from Jim Kent's utility:

textHistogram

So, one line code for all input fastq files would be:

find *.fq.gz -not -name \*raw\* -printf "zcat %p | awk '{if(NR%%4==2) print length(\$1)}' | textHistogram -maxBinCount=59 stdin \n" | sh

Note that you have to use double % to escape the % character for printf formatting control, just like in C. (Thanks for zencuke's answer here)

You will get something like this:

RNAseq.20E_library.result_primary.clean.fa

large values truncated: need 35 bins or larger binSize than 1

Maximum value 53.000000

18 ********* 27730

19 ******************* 58997

20 ************************************************************ 186919

21 ************************ 74536

22 ************** 45171

23 ************* 39107

24 *************** 45560

25 *********************** 70452

26 ************************************************* 154030

27 ************************************************************ 187704

28 ************************** 81198

29 ***** 17016

30 ** 5341

31 * 2439

32 0

33 1

34 0

35 0

36 0

37 0

38 0

39 1

<minVal or >= 40 173

RNAseq.Day_13_library.result_primary.clean.fa

large values truncated: need 35 bins or larger binSize than 1

Maximum value 53.000000

18 ***** 22570

19 ********* 40335

20 ***************************** 127999

21 ********************* 95179

22 ****************** 79808

23 ********* 39596

24 ******* 29423

25 ************* 55438

26 ************************************* 164868

27 ************************************************************ 265722

28 *************************** 120353

29 ********* 38625

30 *** 14684

31 * 5214

32 0

33 0

34 1

<minVal or >= 35 140

One Tip Per Day

Pages

Thursday, May 17, 2012

simple way to get reads length distribution of FASTQ files

No comments:

Post a Comment