Thursday, September 07, 2017

try to be consistent

Worthy to note the lesson I learned today: try to be consistent when sorting with different chromosome order.

I got the same error message as the post: https://github.com/arq5x/bedtools/issues/109

In short, the chromosome order in my sorted BAM file header is like this chr10, chr11, chr12... chr19, chr1, chr20, ... chr22, chr2, chr3 ....chrM, chrX, chrY. The order is determined by the aligner which generates @SQ headers based on how the reference sequences are arranged in the FASTA file from which you built its aligner indices. That means, the order in @SQ is same as the order in your genome.fa.fai (index) file.


Here is my BAM header:

$ samtools view $i -H
@HD VN:1.4 SO:coordinate
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr1 LN:249250621
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chrM LN:16571
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566

Here is my genome.fa.fai used by aligner (let's call it A.genome.fa.fai):

$cat A.genome.fa.fai
chr10 135534747 7 50 51
chr11 135006516 138245456 50 51
chr12 133851895 275952110 50 51
chr13 115169878 412481050 50 51
chr14 107349540 529954333 50 51
chr15 102531392 639450871 50 51
chr16 90354753 744032898 50 51
chr17 81195210 836194754 50 51
chr18 78077248 919013876 50 51
chr19 59128983 998652676 50 51
chr1 249250621 1058964245 50 51
chr20 63025520 1313199886 50 51
chr21 48129895 1377485924 50 51
chr22 51304566 1426578424 50 51
chr2 243199373 1478909088 50 51
chr3 198022430 1726972455 50 51
chr4 191154276 1928955340 50 51
chr5 180915260 2123932708 50 51
chr6 171115067 2308466280 50 51
chr7 159138663 2483003655 50 51
chr8 146364022 2645325098 50 51
chr9 141213431 2794616407 50 51
chrM 16571 2938654113 50 51
chrX 155270560 2938671022 50 51
chrY 59373566 3097047000 50 51

I don't know how people decided the order. The index file in my another machine is somehow different:

$cat B.genome.fa.fai
chrM 16571 6 50 51
chr1 249250621 16915 50 51
chr2 243199373 254252555 50 51
chr3 198022430 502315922 50 51
chr4 191154276 704298807 50 51
chr5 180915260 899276175 50 51
chr6 171115067 1083809747 50 51
chr7 159138663 1258347122 50 51
chr8 146364022 1420668565 50 51
chr9 141213431 1569959874 50 51
chr10 135534747 1713997581 50 51
chr11 135006516 1852243030 50 51
chr12 133851895 1989949684 50 51
chr13 115169878 2126478624 50 51
chr14 107349540 2243951907 50 51
chr15 102531392 2353448445 50 51
chr16 90354753 2458030472 50 51
chr17 81195210 2550192328 50 51
chr18 78077248 2633011450 50 51
chr19 59128983 2712650250 50 51
chr20 63025520 2772961820 50 51
chr21 48129895 2837247858 50 51
chr22 51304566 2886340358 50 51
chrX 155270560 2938671022 50 51
chrY 59373566 3097047000 50 51

As you can see, both orders are different from the one if you use `sort -k1,1 -k2,2n`:

$ sort -k1,1 B.genome.fa.fai
chr1 249250621 16915 50 51
chr10 135534747 1713997581 50 51
chr11 135006516 1852243030 50 51
chr12 133851895 1989949684 50 51
chr13 115169878 2126478624 50 51
chr14 107349540 2243951907 50 51
chr15 102531392 2353448445 50 51
chr16 90354753 2458030472 50 51
chr17 81195210 2550192328 50 51
chr18 78077248 2633011450 50 51
chr19 59128983 2712650250 50 51
chr2 243199373 254252555 50 51
chr20 63025520 2772961820 50 51
chr21 48129895 2837247858 50 51
chr22 51304566 2886340358 50 51
chr3 198022430 502315922 50 51
chr4 191154276 704298807 50 51
chr5 180915260 899276175 50 51
chr6 171115067 1083809747 50 51
chr7 159138663 1258347122 50 51
chr8 146364022 1420668565 50 51
chr9 141213431 1569959874 50 51
chrM 16571 6 50 51
chrX 155270560 2938671022 50 51
chrY 59373566 3097047000 50 51

That's where the error comes from when using '--sorted' option in bedtools commands, e.g. coverage. A solution they provided is to re-sort the input bed using -g or -faidx option by providing the A.genome.fa.fai file. Here is a test:

$ sortBed -faidx A.genome.fa.fai -i <(awk '{OFS="\t"; print $1,1,$2;}' B.genome.fa.fai)
chr10 1 135534747
chr11 1 135006516
chr12 1 133851895
chr13 1 115169878
chr14 1 107349540
chr15 1 102531392
chr16 1 90354753
chr17 1 81195210
chr18 1 78077248
chr19 1 59128983
chr1 1 249250621
chr20 1 63025520
chr21 1 48129895
chr22 1 51304566
chr2 1 243199373
chr3 1 198022430
chr4 1 191154276
chr5 1 180915260
chr6 1 171115067
chr7 1 159138663
chr8 1 146364022
chr9 1 141213431
chrM 1 16571
chrX 1 155270560
chrY 1 59373566

Note that there is a version sort in sort command, e.g. sort -k1,1V, which will generate the natural sort of numbers within text. We should try to be consistent of using this order for future work.

$ sort -k1,1V genome.fa.fai
chr1 249250621 1058964245 50 51
chr2 243199373 1478909088 50 51
chr3 198022430 1726972455 50 51
chr4 191154276 1928955340 50 51
chr5 180915260 2123932708 50 51
chr6 171115067 2308466280 50 51
chr7 159138663 2483003655 50 51
chr8 146364022 2645325098 50 51
chr9 141213431 2794616407 50 51
chr10 135534747 7 50 51
chr11 135006516 138245456 50 51
chr12 133851895 275952110 50 51
chr13 115169878 412481050 50 51
chr14 107349540 529954333 50 51
chr15 102531392 639450871 50 51
chr16 90354753 744032898 50 51
chr17 81195210 836194754 50 51
chr18 78077248 919013876 50 51
chr19 59128983 998652676 50 51
chr20 63025520 1313199886 50 51
chr21 48129895 1377485924 50 51
chr22 51304566 1426578424 50 51
chrM 16571 2938654113 50 51
chrX 155270560 2938671022 50 51
chrY 59373566 3097047000 50 51

No comments:

Post a Comment