Why do my jobs keep restarting in condor? | H. Milton Stewart School of Industrial and Systems Engineering

Jobs in the normal Condor queues (those on Wren) are subject to preemption. When a job is preempted, it is vacated from the node and restarted elsewhere.

Jobs are usually stopped and restarted on Condor for one of four reasons:

Someone logs in locally to the execute machine. Many of our Condor compute nodes were bought by individual faculty members for their own research, and a direct shell login by one of their users will cause jobs to vacate.
You're running your job on the boyles and someone submitted an MPI job. While we have turned off priority pre-emption, parallel universe jobs will still pre-empt other jobs. The time to vacate is set at 5 calendar days. If your job does not complete within 5 calendar days, it will be vacated from the nodes in deference to the parallel universe job.
The job ran out of memory. Initially the process will be killed and the job will be restarted on another machine with more available memory. If memory usage keeps rising, eventually the job will just remain idle.
The node died while your job was running on it. This usually happens due to memory errors, which could be either yours or another user's.

H. Milton Stewart School of Industrial and Systems Engineering

College of Engineering

Search

Search

Georgia Institute of Technology