Running Batch Jobs


The queueing system used on the E10k's is the Load Sharing Facility (LSF)
version 3.2.3, written by Platform Computing Corporation.

The queues structure currently in place on the E10k cluster:
 
Queue Name
Purpose
hpcm Parallel batch jobs on magenta (64 processors)
hpct Parallel batch jobs on teal (32 processors)
short Quick, serial batch jobs that need few resources
normal Moderately long serial batch jobs

Batch Job submission

To submit a batch job, use the "bsub" command.  "bsub" has many options (use "man bsub" to find out more), but all you need to begin with is '-q', to specify a the queue name, '-I' if you want an interactive parallel batch job, and '-n' to specify the minimum and maximim number of processors necessary to run the job.

Here is an example using the hpct queue to run an MPI program called "monte":

        % bsub -o m3.o -e me.e -n 1,32 -q hpct ./monte
        job <425 is submitted to queue <hpc.
 
The files m3.o and m3.e will contain stdout and stderr respectively.

Documentation and the LSF Users manual can be found at the following webpage URL:
http://www.rahul.net/chord/pg30/users-title.html.   Again, the queue design may change, and users are encouraged to frequently check the E10k web page for latest topology and queue structures.