Checkpointing a python Condor job | H. Milton Stewart School of Industrial and Systems Engineering

What is checkpointing

What happens when a machine in our Condor pool, that is in the middle of running your python script, fails?

Well, as far as Condor is concerned, when it recognizes the machine is down, it finds out the necessary information to restart your job on another computer node. It, by default, doesn't care about the data your job has generated so far. This process is referred to as eviction.

Thats where checkpointing comes in! If your job is configured to use checkpointing then when condor restarts your job on the next available node, the data generated so far (or since the last checkpoint) is saved. Great!

There is a catch, though. Native Condor checkpointing is only supported if you use certain programming languages. Information on native checkpointing and the languages supported for it can be found here or in the man page of condor_compile:

http://research.cs.wisc.edu/htcondor/manual/v8.1/condor_compile.html

You no doubt noticed that python is not on that list. Native Condor checkpointing won't help you if your heart is set on using python, and lets face it, most of the hearts here at ISYE are set on python.

Sample Script

Keeping your generated data after Condor has to restart your job on a functioning machine is not impossible.

Using a few techniques and a little extra effort you can avoid losing your data if Condor has to move your job even if you want to keep using python and c'mon who doesn't??

Description of the script and techniques used

If you read my matlab article (the other popular language here at ISYE) on the same subject you'll recognize most of this.

This code will randomly generate two numbers between 1 and 300.
Calculate the mean of these two numbers and store them in a variable.
Once 500 means have been calculated the average of these means are calculated.

We will keep track of the time and every 5 minutes we will store the means in a file as well as the number of means calculated and every time the script is started it will read this file and populate these variables if need be. This will allow the job to pick up if it was interrupted due to eviction.

Modules and techniques

Aside from some basic modules we will use the following modules as well

numpy: We will use numpy to generate our random numbers and calculate our means
time: We will use time to keep track of the time the script is running so we can save our progress every 5 minutes
cPickle: We will use cpickle to store our generated data. Don't confuse cPickle with regular pickle. In short, cPickle is faster than pickle. The following link explains:
https://wiki.python.org/moin/UsingPickle

While you're there make sure to read the section on "Flying Pickle Alert." In short, Don't trust pickles from strangers.
shutil: We will use shutil to write our loop progress to file in order to start it over on eviction.

#! /opt/python/bin/python

import sys
import time
from time import gmtime, strftime
import shutil
import os
import numpy as np
import cPickle as pickle

#First we check for a checkpoint file.  If it exists we import the following variables.  If  the file is not found the script assumes
#this is the first attempt and running it and starts over.
#Variables stored in the chk_py.txt file:
#n_start = integer variable that keeps track of the iteration of the loop
#data_file = string variable that lists the name of the pickle file we are storing our data in
#chk_pt_count = an integar variable that holds the number of times the script has been saved. This is optional and
#was added here for display only.

try:
    fp = open("chk_py.txt")
except:
    print("no chkpt file.  starting anew...\n")
    n_start =  0
    data_file = "sampleMeans.pic"
    chk_pt_count = 0
    sampleMeans = []
    pass
else:

    n_start = int(fp.readline())
    data_file = str(fp.readline())
    data_file = data_file.rstrip("\n")
    with open(data_file, 'r') as l:
        sampleMeans = pickle.load(l)
    chk_pt_count = int(fp.readline())
    fp.close()

r_range_from = 1
n_samples = 500
r_range_to = 300

#here we set how much time should pass before we create a checkpoint.  In this case its 5 minutes
interval_time = 5
chkptTime = int(time.time())/60 + interval_time

#Here we grab n_start which we got from our checkpoint file (chk_pt.txt) and set to the current iteration of our loop.
k = n_start
for k in range(k, n_samples):
    a = np.random.random_integers(r_range_from,r_range_to)
    b = np.random.random_integers(r_range_from,r_range_to)
    c=np.mean(np.array([a,b]))
#Not necessary for functionality.  Purely for demostration purpose you shouldn't add the first two lines and never add the third.
# It doesn't take our cluster any time at all to calculate 500 means.  time.sleep() is used to help demo our checkpoint procedure.
    print(str(k) + '\n')
    print(str(c) + '\n')
    time.sleep(2)

# Here is our checkpoint process.  
##First we check out current time against our chkptTime calculated earlier (5 minutes) and if current time is greater we save our progress.

    if time.time()/60 > chkptTime:
       print("Starting checkpoint functions\n")
       chkptTime += interval_time
       n_start = k
       try:
          fp = open("chk_py.txt", "w")
       except:
          pass
       else:
           chk_pt_count += 1
           fp.write(str(n_start) +'\n')
           with open(data_file, 'w') as i:
               pickle.dump(sampleMeans, i)
           fp.write(str(data_file) + '\n')
           fp.write(str(chk_pt_count) + '\n')
           print("Checkpoint has been done " + str(chk_pt_count))
           fp.close()

##And finally once 500 samples have been collected we calculate the overall mean here
overallMean = mean(sampleMeans)
print(str(overallMean) + '\n')

H. Milton Stewart School of Industrial and Systems Engineering

College of Engineering

Search

What is checkpointing

Sample Script

Description of the script and techniques used

Search

H. Milton Stewart School of Industrial and Systems Engineering