layout | title | author |
---|---|---|
default |
Better Visibility with Progress of Jobs, Job Information, and Job Control |
Bob Freeman |
Previous: Take baby steps: Run Your Script Code from the Terminal
Objectives
- Direct the scheduler to send you email notifications for your jobs
- Create separate log files for output and error streams
- Modify job submissions to provide unique output files for each job
Up to this point, we've focused on switching away from the GUI to batch, restructuring our code to be more flexible with inputs and outputs, and learning some additional LSF commands for getting job information and details.
But we still haven't attained the same nirvanna as running scripts in a GUI -- being able to see progress, understand when things have finished, and troubleshoot with little effort. In this section, we'll handle these missing pieces, to ensure greater transparency and easier efforts for getting job status and troubleshooting.
As opposed to running scripts interactively, when you yourself know when your script starts
and (mostly) when it ends, this is not the case with batch. Just becuase you've submitted
the job sdoesn't mean that it starts immedately. Additionally, unless you know the precise run
time, you also don't know when it ends. LSF's options for bsub
can help.
-
Ensure that your program for fetching and reading work emails is running and visible.
-
Submit a job for processing one of the files with the additional option
-B
for notifying at the beginning of a job:
# use the module load if not continuing from previous sections
module load python/3.8.5
# submit our job with notifications
bsub -q short \
-M 1000 -hl \
-We 15\
-B \
python code/process_data.py --input data/file1.bad --output data/out1.bad
Within a few minutes, you should receive an email notice that the job has started.
- If you wish to send the job start notice to a different email, use the
-u emailaddress
option
# submit our job with begin notifications to a different email destination
bsub -q short \
-M 1000 -hl \
-We 15\
-B -u myemail@address \
python code/process_data.py --input data/file1.bad --output data/out1.bad
You'll notice that we are getting job reports, but these reports have both the summary info of the run, as well as the status output. Let's change this:
- Let's separate the job report from the job output with the
-N
option, as the status output should be kept separate:
# submit our job with begin notifications and separate job reports/output
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
python code/process_data.py --input data/file1.bad --output data/out1.bad
You should now get a job report through the email without the output, and the output should be saved separately.
If we look in our directory from where we are submitting the job, our working directory, you'll likely a number of files with jobID in their title. LSF captures all output, both regular (STDOUT) and errors (STDERR), and puts it into a file conveniently named the JOBID of the job. Although somewhat helpful, but when working on multiple projects and such, jobID-titled files are not that helpful.
- Direct LSF to write the job output to a named file, instead of a jobID-titled file:
# submit our job with begin notifications and separate job reports/output
mkdir logs
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data.out \
python code/process_data.py --input data/file1.bad --output data/out1.bad
Now that we've named the output file, we know we can look for its presence; and once
present, we can use tools like tail
to watch the file end to see what is happening,
in order to get the status and see the progress of our run easily.
Once completed, if we look at the end of one of the shorter files using the tail command, tail logs/process_data.out
,
one can also detect the regular status info and also the error message that caused the script
to fail. In this simple example, likely not a problem. But what if you need the output file
to follow a strict format? And wouldn't it be easier to catch & log errors separately? Easy peasy.
- Direct LSF to separate our log streams into output and error logs:
# submit our job with begin notifications and separate job reports/output
mkdir logs
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data.out -e logs/process_data.err \
python code/process_data.py --input data/file1.bad --output data/out1.bad
When the jobs aborts (due to the error), one can see that the error log file
has a non-zero length, indicating a problem. One can use cat
, head
, tail
, or any
text editor to review the file.
Now that we have some flexibility and visibility with our jobs and progress, we have one last item to tackle. Let's run the above code again.
- Resubmit the last job to the cluster:
# submit our job with begin notifications and separate job reports/output
mkdir logs
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data.out -e logs/process_data.err \
python code/process_data.py --input data/file1.bad --output data/out1.bad
Once we have the notice that the job is running, let's inspect our logs directory and log files:
ls -al logs/
Hmmmm... We only have one process_data.out
and one process_data.err
file, even though
we've now run it twice. Well that's not good, as now I don't have a record of what I've done
or the results from different iterations / executions of the script.
If we remember back to when we first started running batch jobs, output files were generated with the JOBID. We can do that again with our custom filenames by including a placeholder with our output names:
- Submit a new job with the
%J
placeholder in the-o
and-e
options:
# submit our job with begin notifications and separate job reports/output
mkdir logs
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J.out -e logs/process_data_%J.err \
python code/process_data.py --input data/file1.bad --output data/out1.bad
- Look at your logs directory one the job has started to verify the presence of these JOBID-stamped files.
ls -alt logs/
The -t
option shows the directory listing in reverse-chronological order.
- Submit the job again using the same code as in Step 2:
# submit our job with begin notifications and separate job reports/output
mkdir logs
bsub -q short \
-M 1000 -hl \
-We 15\
-B -N \
-o logs/process_data_%J.out -e logs/process_data_%J.err \
python code/process_data.py --input data/file1.bad --output data/out1.bad
- Again, verify that a new set of log files are being generated in the
logs/
directory after this job has started running:
ls -alt logs/
We can even narrow our results by using a specific patter for our listing:
ls -alt logs/process_data_*.{err,out}
The _*
part will direct ls to show only files with the _JOBID
format, and the {err,out}
suffix further restricts ls to show files with either of the two patterns.
Now that we've expanded our sets of tools and capabilities for running job, getting notifications, and creating log files to understand the status and progress of our jobs, we can now look towards running our code flexibily and with few or many copies running at once. We'll see how to handle this next.
Next: Scaling up Work