In an effort to improve efficiency and stability of our file server resource, I've been auditing the file i\o (input\output) our condor jobs have been using for the last several months. I've run across a few common areas that if addressed can not only improve efficiency and stability of our file services but at least in some cases, significantly improve performance of the condor jobs themselves.
The condor submit file
These are some options in the condor submit files I've looked at so far.
When reading these examples keep in mind that on their own none of these should cause a terrible burden on the file server, but if you have 100 jobs running and there are 500 other jobs running and all of them are doing some of these examples it can add up and start to affect performance for everyone.
The log file is a useful tool for me to figure out what went wrong with your job in case of an error you are unable to discern. It does not record any useful debugging information as far as your code or executables go. Therefore, its best practice to put your log file into the /tmp folder on wren. Also, condor is able to write to the same log file for all of your jobs so its not necessary to name the log file after your condor job id like you would for your output and error files. For example:
log = $ENV['HOME']/$(Cluster).$(process).log
is not needed and uses unnecessary file i\o. instead make your log entry like so:
log = /tmp/yourname.log
The error file is what you would read on the screen if you ran it on the command line and something went wrong. It is extremely useful when you are developing your code even when you 'begin' to run test jobs from condor but once you've more or less worked out most of the bugs its safe to turn this feature off. if you don't condor will create the error file for each job regardless if its producing any errors or not. So you can disable this by pointing it to dev null like so:
error = /dev/null
I saved the best for last because it has the most potential for improvement. Not just for file server stability but also can significantly improve job performance.
We all do it. When we're developing our code we will insert a print statement like:
"results so far: 350"
in an loop call
"function soandso returned 'this'"
These statements are not producing any information your research or experiments need in the end, but they help you figure out what is going on as you develop the code. This is a widely used and very necessary practice when you are writing and/or debugging. Once you are done and start submitting to condor in bulk (say 400 iterations) condor has to write these bits of text to the out file over and over again which can definitely put a burden on the file server. So, once you are done developing or debugging remember to comment them out or delete them all together.
Are you even using the output file?
Most of the time, once a job has been developed and is starting to produce data on the cluster in bulk, the code is writing data to a csv or text file, sql database etc. The information printed to the output file is largely not needed anyway. if that is the case, you can treat the output file the same way we treated the error file above. Point it to dev null like so:
output = /dev/null
I'd like to show how converting csv or text files to a sequal statement can improve your condor job performance but I haven't developed all the details on how to show that in a broad sense. I am willing to show work with you individually if you want to contact me.