Blast is a classical method for finding homolog sequences. In one of my recent projects, I need to find the orthologous sequences of Manduca sexta (a moth) protein in other relative species. Since I want to get orthologs for all Manduca's proteins, which is a long list, a straightforward way is to do blast.
First, you need to download/install blast. I use the ncbi-blast-2.2.26+-src.tar.gz downloaded from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Next, you need to download the blast database. Here is the list: ftp://ftp.ncbi.nlm.nih.gov/blast/db/. I used the nr.*tar.gz (non-redundant protein sequence database). If you want to use a smaller and specific one, you can choose for example pdbaa.*tar.gz.
Once you installed the program and database, you can run the executable file called "blastall" in the bin folder. Here are its options:
blastall 2.2.26 arguments:
-p Program Name [String]
-d Database [String]
default = nr
-i Query File [File In]
default = stdin
## alignment
-F Filter query sequence (DUST with blastn, SEG with others) [String]
default = T
-Q Query Genetic code to use [Integer]
default = 1
-D DB Genetic code (for tblast[nx] only) [Integer]
default = 1
-W Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer]
default = 0
-z Effective length of the database (use zero for the real size) [Real]
default = 0
-Y Effective length of the search space (use zero for the real size) [Real]
default = 0
-S Query strands to search against database (for blast[nx], and tblastx)
3 is both, 1 is top, 2 is bottom [Integer]
default = 3
-M Matrix [String]
default = BLOSUM62
## scoring
-G Cost to open a gap (-1 invokes default behavior) [Integer]
default = -1
-E Cost to extend a gap (-1 invokes default behavior) [Integer]
default = -1
-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
default = 0
-q Penalty for a nucleotide mismatch (blastn only) [Integer]
default = -3
-r Reward for a nucleotide match (blastn only) [Integer]
default = 1
-f Threshold for extending hits, default if zero
blastp 11, blastn 0, blastx 12, tblastn 13
tblastx 13, megablast 0 [Real]
default = 0
-g Perform gapped alignment (not available with tblastx) [T/F]
default = T
-y X dropoff value for ungapped extensions in bits (0.0 invokes default behavior)
blastn 20, megablast 10, all others 7 [Real]
default = 0.0
-Z X dropoff value for final gapped alignment in bits (0.0 invokes default behavior)
blastn/megablast 100, tblastx 0, all others 25 [Integer]
default = 0
-w Frame shift penalty (OOF algorithm for blastx) [Integer]
default = 0
-t Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments. (0 invokes default behavior; a negative value disables linking.) [Integer]
default = 0
-C Use composition-based score adjustments for blastp or tblastn:
As first character:
D or d: default (equivalent to T)
0 or F or f: no composition-based statistics
2 or T or t: Composition-based score adjustments as in Bioinformatics 21:902-911,
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, unconditionally
For programs other than tblastn, must either be absent or be D, F or 0.
As second character, if first character is equivalent to 1, 2, or 3:
U or u: unified p-value combining alignment p-value and compositional p-value in round 1 only
[String]
default = D
## display
-I Show GI's in deflines [T/F]
default = F
-v Number of database sequences to show one-line descriptions for (V) [Integer]
default = 500
-b Number of database sequence to show alignments for (B) [Integer]
default = 250
-K Number of best hits from a region to keep. Off by default.
If used a value of 100 is recommended. Very high values of -v or -b is also suggested [Integer]
default = 0
-P 0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
default = 0
## output
-O SeqAlign file [File Out] Optional
-J Believe the query defline [T/F]
default = F
-T Produce HTML output [T/F]
default = F
-e Expectation value (E) [Real]
default = 10.0
-m alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = tabular,
9 tabular with comment lines
10 ASN, text
11 ASN, binary [Integer]
default = 0
range from 0 to 11
-o BLAST report Output File [File Out] Optional
default = stdout
## performance
-a Number of processors to use [Integer]
default = 1
-l Restrict search of database to list of GI's [String] Optional
-U Use lower case filtering of FASTA sequence [T/F] Optional
-R PSI-TBLASTN checkpoint file [File In] Optional
-n MegaBlast search [T/F]
default = F
-L Location on query sequence [String] Optional
-A Multiple Hits window size, default if zero (blastn/megablast 0, all others 40 [Integer]
default = 0
-B Number of concatenated queries, for blastn and tblastn [Integer] Optional
default = 0
-V Force use of the legacy BLAST engine [T/F] Optional
default = F
-s Compute locally optimal Smith-Waterman alignments (This option is only
available for gapped tblastn.) [T/F]
default = F
I highlight the important ones with read.
Here is the command I finally used to blast protein sequences:
blastall -p blastp -d ~/nearline/blast/nr/nr -i ~/nearline/genomes/Manduca_sexta/Jun2012/Msex05162011.genome.except-5small-scaf.maker-2.25.proteins.OGS_June2012.fasta -o Msex05162011.genome.except-5small-scaf.maker-2.25.proteins.OGS_June2012.fasta.blastp.top15.xml -e 1 -b 15 -m 7 -a 8
No comments:
Post a Comment