Skip to content

Latest commit

 

History

History
229 lines (169 loc) · 7.72 KB

3.Better_visibility.md

File metadata and controls

229 lines (169 loc) · 7.72 KB
layout title author
default
Better Visibility with Progress of Jobs, Job Information, and Job Control
Bob Freeman

Previous: Take baby steps: Run Your Script Code from the Terminal


Better Visibility with Progress of Jobs, Job Information, and Job Control

Objectives

  • Direct the scheduler to send you email notifications for your jobs
  • Create separate log files for output and error streams
  • Modify job submissions to provide unique output files for each job

Up to this point, we've focused on switching away from the GUI to batch, restructuring our code to be more flexible with inputs and outputs, and learning some additional LSF commands for getting job information and details.

But we still haven't attained the same nirvanna as running scripts in a GUI -- being able to see progress, understand when things have finished, and troubleshoot with little effort. In this section, we'll handle these missing pieces, to ensure greater transparency and easier efforts for getting job status and troubleshooting.

Get Email Notifications on Job Starts/Ends

As opposed to running scripts interactively, when you yourself know when your script starts and (mostly) when it ends, this is not the case with batch. Just becuase you've submitted the job sdoesn't mean that it starts immedately. Additionally, unless you know the precise run time, you also don't know when it ends. LSF's options for bsub can help.

  1. Ensure that your program for fetching and reading work emails is running and visible.

  2. Submit a job for processing one of the files with the additional option -B for notifying at the beginning of a job:

# use the module load if not continuing from previous sections
module load python/3.8.5

# submit our job with notifications
bsub -q short \
     -M 1000 -hl \
     -We 15\
     -B \
  python code/process_data.py --input data/file1.bad --output data/out1.bad

Within a few minutes, you should receive an email notice that the job has started.

  1. If you wish to send the job start notice to a different email, use the -u emailaddress option
# submit our job with begin notifications to a different email destination
bsub -q short \
     -M 1000 -hl \
     -We 15\
  -B -u myemail@address \
  python code/process_data.py --input data/file1.bad --output data/out1.bad

You'll notice that we are getting job reports, but these reports have both the summary info of the run, as well as the status output. Let's change this:

  1. Let's separate the job report from the job output with the -N option, as the status output should be kept separate:
# submit our job with begin notifications and separate job reports/output
bsub -q short \
     -M 1000 -hl \
     -We 15\
  -B -N \
  python code/process_data.py --input data/file1.bad --output data/out1.bad

You should now get a job report through the email without the output, and the output should be saved separately.

Create Separate Log Files for Output and Error Streams

If we look in our directory from where we are submitting the job, our working directory, you'll likely a number of files with jobID in their title. LSF captures all output, both regular (STDOUT) and errors (STDERR), and puts it into a file conveniently named the JOBID of the job. Although somewhat helpful, but when working on multiple projects and such, jobID-titled files are not that helpful.

  1. Direct LSF to write the job output to a named file, instead of a jobID-titled file:
# submit our job with begin notifications and separate job reports/output
mkdir logs

bsub -q short \
     -M 1000 -hl \
     -We 15\
     -B -N \
     -o logs/process_data.out \
  python code/process_data.py --input data/file1.bad --output data/out1.bad

Now that we've named the output file, we know we can look for its presence; and once present, we can use tools like tail to watch the file end to see what is happening, in order to get the status and see the progress of our run easily.

Once completed, if we look at the end of one of the shorter files using the tail command, tail logs/process_data.out, one can also detect the regular status info and also the error message that caused the script to fail. In this simple example, likely not a problem. But what if you need the output file to follow a strict format? And wouldn't it be easier to catch & log errors separately? Easy peasy.

  1. Direct LSF to separate our log streams into output and error logs:
# submit our job with begin notifications and separate job reports/output
mkdir logs

bsub -q short \
     -M 1000 -hl \
     -We 15\
     -B -N \
     -o logs/process_data.out -e logs/process_data.err \
  python code/process_data.py --input data/file1.bad --output data/out1.bad

When the jobs aborts (due to the error), one can see that the error log file has a non-zero length, indicating a problem. One can use cat, head, tail, or any text editor to review the file.

Create Uniquely-named Log Files for Each Job Run

Now that we have some flexibility and visibility with our jobs and progress, we have one last item to tackle. Let's run the above code again.

  1. Resubmit the last job to the cluster:
# submit our job with begin notifications and separate job reports/output
mkdir logs

bsub -q short \
     -M 1000 -hl \
     -We 15\
     -B -N \
     -o logs/process_data.out -e logs/process_data.err \
  python code/process_data.py --input data/file1.bad --output data/out1.bad

Once we have the notice that the job is running, let's inspect our logs directory and log files:

ls -al logs/

Hmmmm... We only have one process_data.out and one process_data.err file, even though we've now run it twice. Well that's not good, as now I don't have a record of what I've done or the results from different iterations / executions of the script.

If we remember back to when we first started running batch jobs, output files were generated with the JOBID. We can do that again with our custom filenames by including a placeholder with our output names:

  1. Submit a new job with the %J placeholder in the -o and -e options:
# submit our job with begin notifications and separate job reports/output
mkdir logs
bsub -q short \
     -M 1000 -hl \
     -We 15\
     -B -N \
     -o logs/process_data_%J.out -e logs/process_data_%J.err \
  python code/process_data.py --input data/file1.bad --output data/out1.bad
  1. Look at your logs directory one the job has started to verify the presence of these JOBID-stamped files.
ls -alt logs/

The -t option shows the directory listing in reverse-chronological order.

  1. Submit the job again using the same code as in Step 2:
# submit our job with begin notifications and separate job reports/output
mkdir logs

bsub -q short \
     -M 1000 -hl \
     -We 15\
     -B -N \
     -o logs/process_data_%J.out -e logs/process_data_%J.err \
  python code/process_data.py --input data/file1.bad --output data/out1.bad
  1. Again, verify that a new set of log files are being generated in the logs/ directory after this job has started running:
ls -alt logs/

We can even narrow our results by using a specific patter for our listing:

ls -alt logs/process_data_*.{err,out}

The _* part will direct ls to show only files with the _JOBID format, and the {err,out} suffix further restricts ls to show files with either of the two patterns.

Now that we've expanded our sets of tools and capabilities for running job, getting notifications, and creating log files to understand the status and progress of our jobs, we can now look towards running our code flexibily and with few or many copies running at once. We'll see how to handle this next.


Next: Scaling up Work