My learning notes for R, Unix, Perl, statistics, tools/resources, biology etc. everything about Bioinformatics
Monday, June 08, 2020
Be careful of "sort -k1,1 -k2,2n -u"
The right way should be "sort -k1,1 -k2,2n -k3,3n -u" or "sort -k1,1 -k2,2n | sort -u"
Monday, March 26, 2018
Brace expansion
So, what's the wildcard to match all the ten brain tissues? E067-E074, E081, and E082
Here is the trick:
wget http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E0{{67..74},81,81}_15_coreMarks_segments.bed
There are two types of brace expansion in Bash:
1. String list: {string1,string2,...,stringN}
2. Range list: {<START>..<END>}
You can combine them or nest them, for example
{{67..74},81,81} ==> 67,68,69,70,71,72,73,74,81,82
{A..C}{0..2} ==> A0, B0, C0, A1, B1, C1, A2, B2, C2
{{A..C},{0..2}} ==> A, B, C, 0, 1, 2
In the new Bash 4.0, you can even do more, for example
- padding them, e.g. {0001..5} ==> 0001 0002 0003 0004 0005
- specify an increment using ranges, e.g. {1..10..2} ==> 1 3 5 7 9; {a..z..3} ==> a d g j m p s v y
- Using a prefix: 0{1..5} ==> 01 02 03 04 05
- or postfix: ---{A..E}--- ==> ---A--- ---B--- ---C--- ---D--- ---E---
Anyone understand the fun of code below:
function braceify {
[[ $1 == +([[:digit:]]) ]] || return
typeset -a a
read -ra a < <(factor "$1")
eval "echo $(printf '{$(printf ,%%.s {1..%s})}' "${a[@]:1}")"
}
printf 'eval printf "$arg"%s' "$(braceify 1000000)"
See explain here: http://wiki.bash-hackers.org/syntax/expansion/brace
Thursday, January 21, 2016
Continuous tab in indentation could cause problem
When I tested the following code in terminal and got an error at the 2nd echo. It asked to list all files in the current folder, e.g. “Display all 148 possibilities? (y or n)”, rather than moving forwards. I cannot tell any error in the code.
for i in 1 2 3 4 5
do
echo $i
for j in eRNA promoter exon random
do
echo $j
done
done
I figure it out when I display all the white characters, as shown below:
Do you see the difference?
The “working” code, which I copied from the Linux document (http://www.tldp.org/LDP/abs/html/nestedloops.html) has different characters for the nested indentation (-> .. means a tab and two spaces) than the two tabs in the “not working” part (which I typed in Rstudio). I realized that it’s because the tab-autocompletion feature in Linux. Pressing tab a second time will tell the shell to provide a list of matching files.
To avoid the problem, simply click on the “Insert spaces for tab” option in your editor. I used Rstudio, here is where you can find the option:

Monday, January 12, 2015
Using one line command as input for LSF bsub
If you have a complicated script with many commands, you can save into a lsf script (including the shell pathname in the first line) and then submit that script to LSF cluster, e.g. bsub yourscript arguments
In your script, you wrote something like this:
#!/bin/bash
myFirstArgument = $1
Here I found I can also use pipe to connect multiple commands into one line and simply quote them as one command and works in bsub. Here is an example:
Thursday, September 18, 2014
Simple script to generate random size-matched background regions
# ===============================================================
# Script to generate a random size-matched background regions
# Author: Xianjun
# Date: Jun 3, 2014
# Usage:
# toGenerateRandomRegions.sh input.bed > output.random.bed
# or
# cat input.bed | toGenerateRandomRegions.sh -
# ===============================================================
bedfile=$1
# other possible options can be the genome (e.g. hg19, mm9 etc.) or an inclusion or exclusion regions (e.g. exons)
# download hg.genome
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome
cat $bedfile | while read chr start end rest
do
let l=$end-$start;
bedtools random -g hg19.genome -n 1 -l $l
done
Wednesday, May 29, 2013
something I learned from Bo
1. parallel single-thread commands using parafly.
Parafly is one useful program in the Trinity package (a smart RNAseq assembly program). Here is example of how to use it:
echo "mycommand1 > output1" > $paraFile
echo "mycommand2 > output2" >> $paraFile
echo "mycommand3 > output3" >> $paraFile
...
ParaFly -c $paraFile -CPU $CPU
2. resuming pipeline by tracking the status of each step
[ ! -f .status.${STEP}.stepname ] && \
mycommand1 && \
mycommand2 && \
mycommand3 && \
touch .status.${STEP}.stepname
STEP=$((STEP+1))
Using a step number and a status file to record the success of each step. Only if all commands are successfully run, a new status file will be generated. This ideal was also used in Trinity.
Thursday, May 23, 2013
File test operators
-e
file exists
-a
file exists (same as -e, but deprecated)
-f
file is a regular file (not a directory or device file)
-d
file is a directory
-h (or -L)
file is a symbolic link
Thursday, August 30, 2012
arguments array for bash script
#!/bin/bash
args=("$@")
for ((i=0; i < $#; i++))
{
echo "argument $((i+1)): ${args[$i]}"
}
args=("$@") is superior to the sometimes seen args=($@) since the former will not create separate array elements if any of the args have spaces in them (so actually this is a pretty cool way to iterate through a directory with filenames containing spaces)Note that the array index starts at zero, but the command line args start at one, which is why the test in the loop is 'i < $#' not 'i <= $#' and the argument index is offset by 1 in the output.
Ref: http://forums.fedoraforum.org/showthread.php?t=220184
Tuesday, August 28, 2012
How to import functions in a bash script?
. $HOME/prefs.cfg
$include $HOME/prefs.cfg
source $HOME/prefs.cfg
Tuesday, August 21, 2012
Internal Variables for BASH
- $BASH
- path of bash binary
- $BASH_ENV
- $BASH_SUBSHELL
- $BASHPID
- $BASH_VERSINFO[n]
- $BASH_VERSION
- $CDPATH
- $DIRSTACK
- $EDITOR
- $EUID
- $FUNCNAME
- $GLOBIGNORE
- $GROUPS
- $HOME
- $HOSTNAME
- $HOSTTYPE
- $IFS
- $IGNOREEOF
- $LC_COLLATE
- $LC_CTYPE
- $LINENO
- $MACHTYPE
- $OLDPWD
- $OSTYPE
- $PATH
- $PIPESTATUS
- $PPID
- $PROMPT_COMMAND
- $PS1
- $PS2
- $PS3
- $PS4
- $PWD
- $REPLY
- $SECONDS
- $SHELLOPTS
- $SHLVL
- $TMOUT
- $UID
- $0, $1, $2, etc.
- $#
- $*
- $@
- $-
- $!
- ${COMMAND1} &
- $_
- $?
- $$
http://tldp.org/LDP/abs/html/parameter-substitution.html
In general, all chapters in the ABS book are recommended:
http://tldp.org/LDP/abs/html/index.html
Thursday, October 28, 2010
Running R on cluster in a job_array way
All of the above applies to well-behaved, interactive programs. However, sometimes you need to use R to analyze your data. In order to do this, you have to hardcode file names into the R script, because these scripts are not interactive. This is a royal pain. However, there is a solution that makes use of HERE documents in bash. HERE documents also exist in perl, and an online tutorial for them in bash is at http://www.tldp.org/LDP/abs/html/here-docs.html. The short of it is that a HERE document can represent a skeleton document at the end of a shell script. Let’s concoct an example. You have 100 data files, labeled data.1 to data.10. Each file contains a single column of numbers, and you want to do some calculation for each of them, using R. Let’s use a HERE document:
#!/bin/sh
#$ -t 1-10
WORKDIR=/Users/jl566/testing
INFILE=$WORKDIR/data.$SGE_TASK_ID
OUTFILE=$WORKDIR/data.$SGE_TASK_ID.out
# See comment below about paths to R
PATHTOR=/common/bin
if [ -e $OUTFILE ]
then
rm -f $OUTFILE
fi
# Below, the phrase "EOF" marks the beginning and end of the HERE document.
# Basically, what’s going on is that we’re running R, and suppressing all of
# it’s output to STDOUT, and then redirecting whatever’s between the EOF words
# as an R script, and using variable substitution to act on the desired files.
$PATHTOR/R --quiet --no-save > /dev/null << EOF
x<-read.table("$INFILE")
write(mean(x\$V1),"$OUTFILE")
EOF
So now you can use the cluster to analyze your data – just write the R script within the HERE document, and go from there. As I’ve only just figured this out, some caveats are necessary. If anyone experiments and figures out something neat, let me know. Be aware of the following:
1. In my limited experience, indenting is important for HERE documents. In particular, it seems that the beginning and end (i.e. both lines containing the term EOF in the above example), must be aligned with the left-hand edge of the buffer (i.e. not indented at all). So, if you use a HERE document in a conditional or control statement, be mindful of this.
2. In the mean command, I escaped the dollar sign with a backslash. In my limited experiments, both mean(x\$V1) and mean(x$V1) seem to work. However, escaping the dollar sign for the read.table command prevents the variable substitution from occurring in the shell, causing R to fail, because the input file named $INFILE cannot be found. In other words, escaping in that context causes the HERE doc to pass $INFILE as a string literal to R, rather than the value stored in the shell variable.
3. This is more useful than just array jobs on an SGE system. If you know bash well enough, you can write a shell script that takes a load of arguments, and processes them with a HERE document. This solves a major limitation with R scripts themselves. You can do the same in perl, too, on your workstation, but you must use a shell language on the cluster.
- Sent using Google Toolbar"
Monday, October 11, 2010
Linux tip: Bash parameters and parameter expansions
${PARAMETER#WORD} The shell expands WORD as in filename expansion and removes the shortest matching pattern, if any, from the beginning of the expanded value of PARAMETER. Using '@' or '$' results in the pattern removal for each parameter in the list.
${PARAMETER##WORD} Results in removal of the longest matching pattern from the beginning rather than the shortest.
${PARAMETER%WORD} The shell expands WORD as in filename expansion and removes the shortest matching pattern, if any, from the end of the expanded value of PARAMETER. Using '@' or '$' results in the pattern removal for each parameter in the list.
${PARAMETER%%WORD} Results in removal of the longest matching pattern from the end rather than the shortest.
${PARAMETER/PATTERN/STRING} The shell expands PATTERN as in filename expansion and replaces the longest matching pattern, if any, in the expanded value of PARAMETER. To match patterns at the beginning of the expanded value of PARAMETER, prefix PATTERN with # or prefix it with % if the match should be done at the end. If STRING is empty, the trailing / may be omitted and the matches are deleted. Using '@' or '$' results in the pattern substitution for each parameter in the list.
${PARAMETER//PATTERN/STRING} Performs the substitution for all matches instead of just the first.
Listing 11 shows some basic usage of the pattern matching expansions.
Listing 11. Pattern matching examples
[ian@pinguino ~]$ x='a1 b1 c2 d2'
[ian@pinguino ~]$ echo ${x#*1}
b1 c2 d2
[ian@pinguino ~]$ echo ${x##*1}
c2 d2
[ian@pinguino ~]$ echo ${x%1*}
a1 b
[ian@pinguino ~]$ echo ${x%%1*}
a
[ian@pinguino ~]$ echo ${x/1/3}
a3 b1 c2 d2
[ian@pinguino ~]$ echo ${x//1/3}
a3 b3 c2 d2
[ian@pinguino ~]$ echo ${x//?1/z3}
z3 z3 c2 d2
- Sent using Google Toolbar"
parameters and options
$0 : the program itself
$1 : the first parameter
$2 : the second parameter
...
A string enclosed in single or double quotes will be passed as a single parameter, and the quotes will be stripped.
$# : the number of total parameters, not including the program itself
$@ : the array of parameters. Double-quoted "$@" is equal to "$1", "$2", "$3".... This is the most common usage. But there are other variations, like $*, "$*", $@, or "$@". They can be explained distinctly. For more detail, see examples here.
For options, using
getopt optstring optname
where optstring is a double-quote string of option letters, like ":p:q". A colon (:) after an option letter indicates that the option
requires a value; The leading colon tells getopts to be silent and suppress the
normal error messages.
The second parameter, optname, is the name of a variable which will receive the name of the option found. If an option is expected to have a value, the value, if present, will be placed in the variable OPTARG. In silent mode, either of the following two error conditions may occur.
- If an unrecognized option is found, then optname will contain a ? and OPTARG will contain the unknown option.
- If an options that requires a value is found but the value is not, then optname will contain a : and OPTARG will contain the name of the option whose argument is missing.
For example, getopt ":p:q" optnames
will require an option (-p) with value, and -q without having value.
Example:
#!/bin/bash
echo "OPTIND starts at $OPTIND"
while getopts ":pq:" optname
do
case "$optname" in
"p")
echo "Option $optname is specified, with value"
;;
"q")
echo "Option $optname has value $OPTARG"
;;
"?")
echo "Unknown option $OPTARG"
;;
":")
echo "No argument value for option $OPTARG"
;;
*)
# Should not occur
echo "Unknown error while processing options"
;;
esac
echo "OPTIND is now $OPTIND"
done
file/folder existence
[ parameter FILE ]
OR
test parameter FILE
Where parameter can be any one of the following:
- -e: Returns true value if file exists
- -f: Return true value if file exists and regular file
- -r: Return true value if file exists and is readable
- -w: Return true value if file exists and is writable
- -x: Return true value if file exists and is executable
- -d: Return true value if exists and is a directory
Find out if file /etc/passwd file exists or not:
$ [ -f /etc/passwd ] && echo "File exists" || echo "File does not exists"
# Make the directory for the job ID you are running if it does not exist
[ -d $HOME/scratch/jobid_$JOB_ID ] || mkdir -p $HOME/scratch/jobid_$JOB_ID
You can use conditional expressions in a shell script: #!/bin/bashSee if a directory exists or not with NOT operator:
FILE=$1
if [ -f $FILE ];
then
echo "File $FILE exists"
else
echo "File $FILE does not exists"
fi
[ ! -d directory ]
OR
! test directory