Wednesday, May 09, 2012

split a file at specific line

I was facing a task: extracting the FASTA sequence from a GFF (or GFF3) file. Well, I just noticed there is a Bio::Perl script for that (http://www.bioperl.org/wiki/Getting_Fasta_sequences_from_a_GFF). But why not using a simple bash/awk for that? I like one-line coding :)

Here it is:
awk '/^>/,/############/' scaffold.gff

the grammer is awk '/from/, /to/' filename, to get lines from the one containing "from" to the line containing "to".  From my tests, I did not get very clear when there are multiple "from" and/or "to". So, be careful!

Another option is:
awk '/^>/ {p=1}; p==1 {print}' scaffold.gff

So, the final code for catenate all extracted fasta sequences is:

find -name scaffold\*.gff -exec sh -c "awk '/^>/ {p=1}; p==1 {print}' {} >> Manduca_gff_files_version_1.scaffold.fa" \; &

If you care the order of output, e.g. scaffold0001, scaffold0002, scaffold0003, ... then you have to sort the result of find (because find does not sort) first:
for i in `find -name scaffold\*.gff | sort`; do awk '/^>/ {p=1}; p==1 {print}' $i >> Manduca_gff_files_version_1.scaffold.fa; done

References:
  1. http://www.unix.com/shell-programming-scripting/6959-split-file-specified-string.html
  2. http://www.bioperl.org/wiki/Getting_Fasta_sequences_from_a_GFF


No comments:

Post a Comment