Thursday, October 28, 2010

Running R on cluster in a job_array way

Simple-Job-Array-Howto - GridWiki: "Example: R Scripts with Grid Engine Job Arrays

All of the above applies to well-behaved, interactive programs. However, sometimes you need to use R to analyze your data. In order to do this, you have to hardcode file names into the R script, because these scripts are not interactive. This is a royal pain. However, there is a solution that makes use of HERE documents in bash. HERE documents also exist in perl, and an online tutorial for them in bash is at http://www.tldp.org/LDP/abs/html/here-docs.html. The short of it is that a HERE document can represent a skeleton document at the end of a shell script. Let’s concoct an example. You have 100 data files, labeled data.1 to data.10. Each file contains a single column of numbers, and you want to do some calculation for each of them, using R. Let’s use a HERE document:
#!/bin/sh
#$ -t 1-10
WORKDIR=/Users/jl566/testing
INFILE=$WORKDIR/data.$SGE_TASK_ID
OUTFILE=$WORKDIR/data.$SGE_TASK_ID.out
# See comment below about paths to R
PATHTOR=/common/bin
if [ -e $OUTFILE ]
then
rm -f $OUTFILE
fi
# Below, the phrase "EOF" marks the beginning and end of the HERE document.
# Basically, what’s going on is that we’re running R, and suppressing all of
# it’s output to STDOUT, and then redirecting whatever’s between the EOF words
# as an R script, and using variable substitution to act on the desired files.
$PATHTOR/R --quiet --no-save > /dev/null << EOF
x<-read.table("$INFILE")
write(mean(x\$V1),"$OUTFILE")
EOF

So now you can use the cluster to analyze your data – just write the R script within the HERE document, and go from there. As I’ve only just figured this out, some caveats are necessary. If anyone experiments and figures out something neat, let me know. Be aware of the following:

1. In my limited experience, indenting is important for HERE documents. In particular, it seems that the beginning and end (i.e. both lines containing the term EOF in the above example), must be aligned with the left-hand edge of the buffer (i.e. not indented at all). So, if you use a HERE document in a conditional or control statement, be mindful of this.
2. In the mean command, I escaped the dollar sign with a backslash. In my limited experiments, both mean(x\$V1) and mean(x$V1) seem to work. However, escaping the dollar sign for the read.table command prevents the variable substitution from occurring in the shell, causing R to fail, because the input file named $INFILE cannot be found. In other words, escaping in that context causes the HERE doc to pass $INFILE as a string literal to R, rather than the value stored in the shell variable.
3. This is more useful than just array jobs on an SGE system. If you know bash well enough, you can write a shell script that takes a load of arguments, and processes them with a HERE document. This solves a major limitation with R scripts themselves. You can do the same in perl, too, on your workstation, but you must use a shell language on the cluster.

- Sent using Google Toolbar"

5 comments:

  1. Two steps to submitting an R BATCH job to the cluster
    Using the Sun Grid Engine to submit an R BATCH job to the cluster is very simple. You would normally do something like

    R CMD BATCH mycommands.R &

    where mycommands.R is a file of R commands that you want to run. Remember, you CANNOT do this on usher.

    1. First you need to create a new file that will call R. Let's call this new file batch.sh. You need to put the batch.sh file in the same directory as your mycommands.R file.

    To run an R BATCH job on the cluster using the mycommands.R file, your batch.sh file would look something like this:

    #!/bin/bash

    R CMD BATCH mycommands.R

    The technical name for this file is "shell script". Knowing this might help you communicate with the system administrator.
    2. Once you've written your short batch.sh file you can submit it to the cluster via the command

    qsub -cwd batch.sh

    The -cwd switch tells the cluster to execute the batch.sh script in the current working directory (otherwise, it will run in your home directory, which is probably not what you want).

    That's all you have to do! There are a few things to note:

    Note that you don't have to put the & at the end of the line. qsub automatically sends your job into the background so that you can do other things.

    ReplyDelete
  2. I recently solved the problem: I ran a program from Linux. The basic
    idea is using enviroment variables and ?Sys.getenv

    This is a general approach to calling R from a script file.

    Here is part of my codes and explanation:

    # assign some values to the arguments
    dvar=5
    categorical=3
    ....

    # export them as env. var.
    export dvar mod_min mod_max tree_no cp categorical

    #run your r program in a batch mode
    R CMD BATCH < $HOME/r/batch/mk_trees.r

    The following is part of my mk_trees.r:
    #load arguments
    cargs<-Sys.getenv(c('dvar','mod_min','mod_max','tree_no','cp','categorical'))
    dvar<-as.numeric(cargs[1])
    mod_min<-as.numeric(cargs[2])
    mod_max<-as.numeric(cargs[3])
    tree_no<-as.numeric(cargs[4])
    cp<-as.numeric(cargs[5])
    cc<-strsplit(cargs[6], "\n")
    categorical<-as.numeric(cc[[1]])

    ReplyDelete
  3. On Fri, 25 Feb 2005, Brahm, David wrote:

    > I have never understood the difference between
    >> R CMD BATCH --vanilla --slave myScript.R outFile.txt
    > and
    >> R --vanilla --slave < myScript.R > outFile.txt

    [On Unix-alikes]

    Mainly cosmetics, e.g. the first adds timings and options(echo=TRUE) and
    generates a suitably named outfile. [Also sets --restore --save which
    you then override.]

    It used to set --gui=none to avoid the overhead of loading the X11 module.

    More importantly, the first redirects stderr and yours does not, but see
    my reply to the OP's other thread for a closer equivalent.

    [...]

    --
    Brian D. Ripley, ripley@stat...

    ReplyDelete
  4. On Thu, 23-Nov-2006 at 02:44PM +0000, Prof Brian Ripley wrote:

    |> Try this:
    |>
    |> gannet% cat month.R
    |> x <- commandArgs()
    |> print(x[length(x)])
    |>
    |> gannet% R --slave --args January < month.R
    |> [1] "January"

    ReplyDelete
  5. Rmpi provides an MPI interface for R [Rmpi documentation].
    The package snow (Simple Network of Workstations) implements a simple mechanism for using a workstation cluster for ``embarrassingly parallel'' computations in R. [snow documentation]

    Users who wish to use Rmpi and SNOW will need to add the following line to the .cshrc or .bashrc file located in the user's home directory. For .cshrc

    setenv PATH /usr/local/openmpi/bin:/usr/local/lam/bin:$PATH

    or, for .bashrc

    PATH=/usr/local/openmpi/bin:/usr/local/lam/bin:$PATH
    export PATH

    Restart biowulf session. This only needs to be done once.

    Sample Rmpi batch script:

    #!/bin/bash
    #PBS -j oe

    cd $PBS_O_WORKDIR

    # Get OpenMPI in our PATH. openmpi_ipath and openmpi_ib
    # can also be used if running over those interconnects.

    `which mpirun` -n 1 -machinefile $PBS_NODEFILE R --vanilla > myrmpi.out < myrmpi.out <<EOF

    library(snow)
    cl <- makeCluster($np, type = "MPI")
    clusterCall(cl, function() Sys.info()[c("nodename","machine")])
    clusterCall(cl, runif, $np)
    stopCluster(cl)
    mpi.quit()
    EOF


    Either of the above scripts could be submitted with:

    qsub -v np=4 -l nodes=2 myscript.bat

    Note that it is entirely up to the user to run the appropriate number of processes for the nodes requested. In the example above, the $np variable is set to 4 and exported via the qsub command, and this variable is used in the script to run 4 snow processes on 2 dual-cpu nodes. Note: myrmpi.out contains the results from the finished job.

    ReplyDelete