Condor supports high-speed distributed parallel computing using OpenMPI. MPI jobs require use of the dedicated queue on hooke, and can't be submitted from wren.
MPI jobs are run on the Boyle nodes, consisting of 16 eight-core machines with a low-latency Infiniband cross-connect between the nodes. OpenMPI jobs automatically use the fastest available transport method for message passing between processes.
Condor handles MPI jobs via a front end wrapper script called ompiscript (full path: /usr/local/bin/ompiscript). This wrapper is submitted as the executable name, and the submitter's MPI executable becomes the first command line argument. The wrapper automates all of the usual tasks associated with starting an MPI job: creates a hosts file, does the "mpirun" command, and takes care of passwordless ssh access.
Of course it remains crucial that users make sure their MPI code works in the shell before submitting it to Condor.
Submit file examples follow.
Example 1: running on any available nodes
The generic form of MPI submission will run on any available CPU (currently this means just on the boyle nodes). Contact IT first if you intend to submit an MPI job to more than 32 CPUs.
universe = parallel executable = /usr/local/bin/ompiscript arguments = exe1 argv1 argv2 .... getenv = True ## needed for your env to be present on execute nodes output = exe1.out ## stdout error = exe.err ## stderr machine_count = 32 queue
If you want separate output files for each process, you can use a macro like this:
output = out/exe1.out.$(NODE) error = out/exe1.err.$(NODE)
Example 2: Running a job on a specific node
If a job is run on 8 CPUs it can be kept on a single node, which will produce slightly faster interprocess communication. The user will need to find a node with no jobs currently running on it and specify that node in the command file.
universe = parallel executable = /usr/local/bin/ompiscript arguments = exe2 argv1 argv2 .... getenv = True output = exe2.out error = exe2.err machine_count = 8 queue