Problem:

I need to run a long job on one of the UNIX systems. What do I need to know to do this most efficiently?
 
 

Solution:

Often there is a requirement to run jobs that so long to run that sitting in front of the terminal waiting for the results is not practical. This document is intended to help users run large jobs in the "background" in such a way as to not negatively impact other users on the computers.

Most users in ISyE run the Korn Shell, thus this document assumes such use. There are a few steps to the process of running large/long jobs.

  • Starting the job in the "Background" so that it will continue to run after the user logs off of the computer.
  • Setting the running priority so the job will not negatively impact other users "interactive" use of the same computer.
  • Monitoring the job to make sure it is running properly and is not out of control.

The text below uses examples based on the execution of a job named myprog which requires an argument of myargs. myprog can be any program or script and myargs can be any argument or list of arguments.

Running in the Background

There are three standard ways to put a job into the Background.

  • Use of the nohup command
  • Use of the bg command
  • Use of an &

Using nohup

The nohup command is the preferred method for running a job in the background on any UNIX system. It is the best method for allowing the job to continue running in the background even after the user logs off of the system.

Usage: nohup myprog myargs&

The nohup command will then execute the program in the "background" and return the user to a normal command prompt. This allows the user to log off of the computer without having the job terminate.

The nohup command will trap any output to STDOUT (usually what would display to the screen) and save it in a file called nohup.out which will be located in the directory in which the command was executed.

On the Korn Shell, when the user runs the exit command to log off, the following warning message will be displayed:

You have jobs running

Typing exit a second time will successfully log off the user. Despite the warning message, the nohuped job will continue to run. Look at the manual pages for nohup for more information(man nohup).

Under some circumstances nohup can act a bit odd and hang ssh (secure shell) connections when exiting or will allow the job to be killed when the user logs out. If you experience this problem, the work around is to "double nohup" the job....

Create a script (here I name it myscript) that will run the job. It should contain something like:

#!/bin/ksh
nohup myprog myargs&

chmod +x myscript to make the script executable

Then run the script:

nohup myscript&

Then log out.

exit
exit

Using bg

Jobs can be placed in the background by running them interactively:

myprog myargs

Then halting the program with a CNTRL-Z ^Z and then executing the "bg" command

bg

Which takes that last halted program and starts it running again, but now in the background.

This does NOT allow the user to log out, and thus should never be used for running Long jobs, use nohup in such circumstances. For more information on "bg" see the manual pages man bg for information on how to take a background process and make it once again the forground process, see the manual pages for the "fg" command.

Using &

Jobs can be pushed directly into the background as follows:

myprog myargs&

The end result is identical to using cntrl-Z followed by the bg command. This does NOT allow the user to log out and thus should never be used for running long jobs. Use nohup in such circumstances.

Setting Job Execution Priorities

Any jobs run in the background for an extended period of time (anything over twenty minutes as a guideline) must have their run priority changed to avoid causing other users problems.

The renice command is used to set the priority for a process.

First, the process ID number must be found:

ps -e|grep myprog

Will give output such as:

23452 pts/9 0:00 myprog

where 23452 (in this example) is the process ID number. The job can be "niced" with:

renice 10 23452

A nice value of 10 is typical for a background process. It will allow Interactive users to function properly while still giving a large portion of the CPU compute cycles to myprog.

Any background jobs not run at a nice value of 10 or higher on general access ISyE UNIX machines are subject to termination when a complaint is received. So, it is very important to use the nice command. It is the nice thing to do; hence it's name.

For more information on nicing programs see the manual pages (man) for nice and renice.

Monitoring a Background Job

The easiest way to keep an eye on a background process is with the top command. The output looks like the following:

$ top

last pid: 25120; load averages: 2.53, 1.92, 1.43
16:40:48
59 processes: 56 sleeping, 2 running, 1 on cpu
CPU states: 0.0% idle, 99.8% user, 0.2% kernel, 0.0% iowait, 0.0% swap
Memory: 512M real, 200M free, 49M swap in use, 1347M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
23452 myuser 1 40 10 2008K 1608K run 108:10 74.08% myprog
25079 root 1 30 0 2256K 1640K run 0:02 10.29% patchadd
25009 root 1 58 0 2816K 2064K cpu 0:00 0.57% top
24993 root 1 48 0 312K 312K sleep 0:00 0.04% sh
23336 myuser 8 59 0 8968K 7560K sleep 0:01 0.02% dtwm
24991 root 1 38 0 1784K 1296K sleep 0:00 0.02% in.rlogind
178 root 1 48 0 2184K 1624K sleep 0:00 0.00% inetd
203 root 1 59 0 4680K 2920K sleep 12:26 0.00% egd.pl
188 root 5 58 0 3624K 2568K sleep 2:06 0.00% automountd
229 root 1 48 0 1696K 816K sleep 1:37 0.00% prngd
248 root 1 58 0 2688K 1352K sleep 1:24 0.00% sshd
224 root 10 53 0 3832K 3368K sleep 0:10 0.00% nscd
23006 root 1 59 0 125M 30M sleep 0:04 0.00% Xsun
199 root 9 58 0 4288K 2280K sleep 0:01 0.00% syslogd
1 root 1 58 0 816K 384K sleep 0:01 0.00% init

Notice that PID 23452 is myprog, running with a nice value of 10, accumulating 74% of the CPU time (since there are no other users on the system, the percentage if very high, it would be more evenly split if another user was on the machine). Myprog is using 2008K or 2 megabytes of memory.

The important things to look for are: that the runtime is in the range of what you expect (to make sure the job is not hung, or not working correctly). that the memory use of the program is reasonable and what you expect and that the amount of memory available on the system is much less than the maximum amount your program will require.

On the Memory: line, the first number is the real memory size of the machine and will not change. The second number is the amount of this "real" memory that is free or available to new or growing programs. If this amount is much less than what your program will need, then it should be run on a different computer.

The third line is the amount of Hard Disk space that is being used as though it were real memory. If this number is comparable to the amount of real memory (it is not in this case as 49M is much less than 512M), then the computer is over-laden with large jobs and another system should be used instead. Finally the last number is the amount of Hard Disk space that is allocated to be used as memory, but is not actively in use by the system.

The one critical factor is that the user must know the approximate amount of memory that their job will require. Then, this must be compared to the "free memory" plus the "swap free" number (the last number on the Memory: line).

If the amount of memory needed is more than ~70% of the sum of the fee swap and free memory, then the job should not be run on the given system. For this machine the calculation is:

200MB + 1347MB = 1547MB 1547MB * 0.7 = 1082.9MB

If a job will require more that 1000MB or 1GB, then the system above would be a poor choice. Keep in mind that the machine runs 100 times Slower when using the Swap (or Hard Disk) for memory. To get decent performance from the computer system, then the active memory required by the job should be no more than 120% of the free memory amount. In this case 240MB.