Instructions for the use of ISyE's computational Linux cluster:
Initial requirements
You need an ISyE Unix Account.
Logging in
The central 64-bit compute server is isye-compute01.isye.gatech.edu (or just "compute01" if you are on ISyE's subnets).
You should be able to access compute01 via ssh using your usual ISyE Unix account and password. Your Unix home directory will be mounted and accessible.
Compute cluster hierarchy
The central server, compute01, is called the "head node". When compute jobs are ready to be run, the head node assigns execution of them to subordinate servers which are called "compute nodes".
Users only need to log into the head node. The compute nodes do not support direct logins, and they cannot mount home directories. Compute jobs must be submitted to them through the Condor batch queue scheduler.
Development
Development for 64-bit Linux compute jobs should be done on compute01, and short interactive test runs can be performed there. You should test your program "locally" on the head node before submitting it to Condor. By "local" we mean executing the program in your login shell on compute01.
You want to make sure your code starts and runs as you expect for at least a few minutes locally on compute01, before you submit it to Condor. By testing the program locally first, you will save yourself and everyone a lot of headaches. Condor works great once your code is ready to go, but it's a very frustrating way to debug!
Long-running compute jobs and large batches of unattended short-running jobs are not allowed on the head node.
GUI development interfaces can be exported to the local X Windows desktop via ssh X forwarding (the ssh -X option).
Condor commands
Condor's executables are located in /opt/condor/bin, which should be in your default command path on compute01, with man pages available for most important commands.
Generally the most useful Condor commands are:
condor_submit -- submit jobs to the condor batch scheduler. condor_status -- show overall cluster status. condor_q -- show current job queue and status of jobs. condor_rm -- remove jobs from the condor queue. condor_status -master -- View all compute nodes and there resources. condor_status -compact -constraint 'TotalGpus > 0' -af Machine TotalGpus CUDADeviceName -- View compute nodes with CUDA GPU’s. condor_qedit RequestMemory 3078 -- Edit an existing job. condor_rm 4.2 -- Removes only the job with job ID 4.2. condor_q analyze 107.5 -- Analyze a particular.
Job submission
For most compute jobs, you will need to create a condor submit description file (a ".cmd" file) and then run "condor_submit .cmd".
An example .cmd file is listed below. Other examples can be found in /opt/condor/examples, and the syntax for .cmd files is documented in the condor_submit man page.
In the simplest cases, you can submit a job to condor by running "consub filename". This will work for a simple program that has no options, no input and output files (so if run locally, output would go to the screen). consub will create a simple filename.cmd file, copy the executable to a compute node and run the job, and then copy the output back to filename.out.
File accessibility on the compute nodes
You can run jobs directly from your Unix home directory tree, and all your files will be accessible to the compute nodes.
If you need more space than currently allocated, request it from IT.
It is possible to transfer files to the nodes, and then back to compute01 upon completion (should_transfer_files). On dedicated compute clusters like ours this is generally not the desired method. Your home directory is mounted on the compute nodes just as it is on compute01 or castle, so there is really no need to transfer files.
Log files
The Condor log file should be written to /tmp, rather than to your home directory tree. Any user can create a directory in /tmp. Typically just create /tmp/ and use that as an output path for the "log" directive (see example below).
Condor universe
You must specify a "universe" to run a condor job in. For basic compute jobs written in any language, add a line to use the "vanilla" universe. If you do not explicitly specify the universe, it will default to "standard" universe which will not work without special compile-time options.
Example .cmd file
universe = vanilla executable = foo output = foo.out error = foo.err log = /tmp//foo.log arguments = 10 20 30 queue
This command file should be called "foo.cmd" and placed in the same global scratch space subdirectory as the program "foo". You will then type "condor_submit foo.cmd" to run this compute job. Normal screen output will go to to "foo.out" and stderr will go to "foo.err". The program expects three command-line arguments which are "10", "20", and "30".
Note that the log file is placed in /tmp! This is very important! Log files written to NFS-mounted home or scratch directories will often break condor due to NFS locking problems. So, once again, create a directory for yourself in /tmp, and write your log files there.
Any other input and output files generated by the program will be accessed normally, as long as the path is accessible to the program (normally, this means in your home or scratch directory).
In many cases you may also need to run jobs that won't work correctly without your login environment variables (such as PATH, LD_LIBRARY_PATH, or software-specific environment variables). Since these won't be available by default on the compute nodes, you need to explicitly add them via the submit file using this line:
getenv = true
Notification
The cluster will not normally notify you of job completion. If you'd like notification, you should add one of the following lines depending on the level of notification you would like:
notification = error notification = complete notification = always
The default email address for these notifications is the ISyE email address of the user submitting the jobs. If you'd like mail sent to an alternate address, add the following line:
notify_user = my.email@domain.com
Please note: If you are creating more than 10 or so jobs, please disable all notifications except errors. Realize that if you start 100 jobs with notifications turned on you're going to get 100 emails in your inbox.
Examples in /opt/examples on isye-compute01.isye.gatech.edu
There are a set of simple jobs in /opt/examples on isye-compute01.isye.gatech.edu. Use them as templates on your submit files but please be careful to not submit from those directories. You should be able to copy them to your home directory.
This list does not include every type of job possible on our system. If you want to something not included in this list send an email to our helpdesk (help[at]isye.gatech.edu) and we will work with you on a solution. Currently there are:
- basic_condor_example
- matlab_condor_example
- c++ without checkpointing
- c++ with checkpointing (in progress)
- R_condor_example
- R_mpi_example
More on Condor
For further documentation, please see the Condor User's Manual.