SGE / Batch-queuing system

From PERFORM Wiki
Jump to: navigation, search

Utilizing PERFORM’s queue is the best way to quickly process large amount of data. When you submit a job to the queue, it gets assigned to one of the cluster computers that has the available resources (RAM & CPU) to process it. The compute nodes (perf-hpc01 through perf-hpc12) will process jobs first, followed by workstation nodes.

Additionally, graphically intense programs (such as MATLAB’s desktop environment / GUI ) should rarely be used on the cluster since you will encounter latency (lag). These programs should really only be used for testing, creating scripts, or getting instant feedback. For maximum performance you should write scripts that do the calculations in the background without having to load all of the graphical libraries. Alternatively, you could use these programs on your own computer to develop a script and then copy it over to the cluster to process your data.


Submitting Jobs to the queue:

A job can be submitted to the cluster any of the 2 ways:

  1. From within a script:
    #####example of a script resampleimage.sh
    #!/bin/bash
    qsub -j y -o logs/mnc_out.txt -V -cwd -q all.q -N mncresample <<END

    mincresample /path/to/image/file.mnc -like /path/to/image/template.mnc out.mnc

    END

    Where:
    -o logs/mnc_out.txt is the output of the job. Make sure that the logs folder exists
    -q all.q is the queue that you are submitting the job to
    -N mncresample is the name displayed in the queue
    mincresample … is the command(s) you want executed
    -V exports the environment variables (loaded paths & modules)
    -j y merge the error stream and the output stream into the logfile
    -cwd active the job from the current working directory
    • Note that you would have to make the script executable after (i.e. chmod 755 resampleimage.sh ) and load any module that the commands would rely on prior to executing the script.

  2. From the command line: qsub -j y -o logs/mnc_out.txt -V -cwd -q all.q -N mncresample ./resampleimage2.sh
    Where ./resampleimage2.sh is the name of any script that you want to execute. In this example, the resampleimage2.sh script would only contain the mincresample line and not the qsub and END lines.


Advanced Flags:

If you are running jobs that requests a certain amount of cores or RAM, you can request it. Note that the job won’t process until a machine that can satisfy your request entirely becomes available.

Add the following flags to your qsub command:

-pe smp <num_cores> reserves <num_cores> which is a number from 1-32
-l h_vmem=<num>G reserves <num> GB of RAM, i.e. -l h_vmem=12G reserves 12GB


Running MATLAB scripts:

If you are going to run a MATLAB script (.m file) with qsub, then you need to make sure that you launch MATLAB first so it can read the .m file. If not, qsub won't know what to do with your script. Here is a sample command

echo matlab -nojvm -nodisplay -nosplash -nodesktop -r \"run ./m.m\" | qsub -j y -o logs/mat.txt -V -cwd -q matlab.q -N matlabJob

where m.m is the MATLAB script that you are running. If you notice the command is similar to the option #2 for "Submitting Jobs to the queue", with the difference that you are piping in the command you want to use (i.e. echo matlab -nodisplay -nosplash -nodesktop -r \"run ./m.m\" | ). It is important that you specify the -nodesktop flag for MATLAB, since the cluster isn't run using MATLAB's graphical desktop. Also, this command is being run from the same directory as the m.m file, so if you want this to work from any directory, replace ./ with the path to your .m file.


Monitoring a job on the queue:

If you want to see the progress of your job in the queue you can use the following commands:

qstat to get a quick summary of your job status - either pending ( qw ), running on a machine ( r ), or in the error state ( eqw )
qstat -f to see your jobs in relation to the each machine on the cluster (which is a more detailed overview of qstat). You can also see the status of the cluster with it.
qstat -f -u \* to see every job running on the cluster (this can be helpful to see if the queue being used by others, and if you should expect that you will have to wait awhile.


Deleting a job on the queue:

qdel <job-ID> To delete a specific job
qdel -u <username> To delete all of your jobs in the queue


Job status / Error checking:

qstat -j <job-ID> To check the status of a job

This can be useful when the job is in the error queue status ( eqw )


Reserving a cpu core:

If you are working on the cluster, but don’t want to submit a job to the queue. You can reserve a cpu core with the following command:

qrsh

The qrsh command assigns to you a machine at random.