Running matlab via our Condor Cluster can be a great benefit. What if the computers condor chooses to run your code becomes unavailable before it finishes it? There is good news and bad news... The good news is condor will automatically move your job to another available computer. The bad news is any data generated by your job since it started on the first node will have to be generated again. Thats where checkpointing comes in.
What is checkpointing?
Checkpointing is a way to save the progress of the jobs that are running on nodes so that after the node becomes unavailable and the cluster finds another node to run your code on, it would not have to begin at the start of the code. It could start again approximately where it left off.
Condor can checkpoint automatically many languages but matlab is not one of them, but that does not mean we're out of luck.
With a few pieces of matlab code we can create our own checkpoints which will achieve basically the same result.**
Here are some basic skills we use along with matlab documentation:
- mat-files: Files used by matlab to store variables(data) outside the process used to run the script in case you wanted to pass the information to another script or ,in our case, pass it to the next iteration of the script when condor moves the job to another node.
- For Loops: We will create iterations of our mean calculations, collect them and eventually generate the overall mean of the iterations together.
- We use etime to keep track of the time elapsed and checkpoint your progress every 5 minutes.
% This is the init section. If the chk_p.mat our desired product.mat files exist it % loads the variables stored in them allowing the for loop to start from where it was interrupted. if exist('chk_p.mat') load('chk_p.mat'); load('product.mat'); % If these files don't exist the script inits at the beginning. else k_start = 1; chk_pt = 1; end %Regardless of where the script starts these variables are constant. npoints = 50; nsamples = 500000000; % These two variables control when the chk_pt occurs. intv_time is the interval which chk_pts occur. The start_time % clock records when this iteration begins. intv_time = 5; start_time = clock; %As simple as this is the only 'real work' being done here. But really this can be as complex as you like. for k = k_start:nsamples iterationString = [ 'Iteration #', int2str(k)]; currentData = rand(npoints,1); sampleMean(k) = mean(currentData); %This statement checks the start time against the current clock using the function etime %If working_time is greater than intv_time from above than a chk_pt is created. %The display strings are for tutorial purposes and do not need to be included in your work. working_time = etime(clock, start_time) / 60; if working_time > intv_time disp('-----') iterationString = ['Pausing iterations at iter #',int2str(k), 'and saving product in case of eviction']; disp(iterationString) chk_pt_string = ['This is checkpoint number #',int2str(chk_pt)]; disp(chk_pt_string) disp('-----') start_time = clock; chk_pt = chk_pt+1; k_start = k; %This statement saves the for loop steps and the actual product data we are running the code to produce in the first place. %Save the for loops steps and the product data in different files. The chk_p variables should be discarded after the job restarts. save('chk_p.mat','k_start' ,'chk_pt'); save('product.mat','sampleMean'); end end %This statement creates the final product once the loop finishes %It also cleans up the temp files product.mat and chk_p.mat files resultMean = mean(sampleMean) delete('product.mat','chk_p.mat'); save('finalProduct.mat','resultMean');