layout | title | author |
---|---|---|
default |
Set Up Your Script's Inputs and Outputs |
Bob Freeman |
Previous: Take baby steps: Run Your Script Code from the Terminal
Objectives
- Learn how to abstract your code to work on any input file
- Learn how to have your code create unique output files
- Learn how to look at job history details
Now that we have the code running correctly in batch with one input file, we will make some tweaks so that we can ask the code to process any input file. And similarly, we will change the code so that a different output file is made for each input file. And finally, once our code has completed or if there are any problems, we'll want to understand at multiple levels what may have been at fault.
One of the hallmarks of saving time, automating work, and scaling one's analyses is making one's scripts as flexible as possible. Using hard-coded paths to files for input and output is incredibly limiting, esp. as one has to edit code for a different file. The work on abstracting the code to pass parameters, options, or command-line arguments opens the possibility of changing the behavior of one's code with a simple switch (flag).
You've seen this before:
program --verbose --input myfile --output myotherfile
or something similar. The options (--flags) change the behavior of the program, and items like
myfile
and myotherfile
are arguments or flag options (see XXX). We'll now do this with our
own program, first the quick and dirty way; and afterwards the more flexible way.
-
Open the script in your favorite editor.
-
Comment out line9 (our in_filepath statement). After this line, insert the following:
#in_filepath = "data/file1.txt"
# check for program name & input file
if len(sys.argv) != 2:
print('Error: incorrect number of arguments')
print()
print('Usage: {} inputfile'.format(sys.argv[0]))
print()
sys.exit(1)
else:
#OK, we have the right #. Grab the input filename
in_filepath = sys.argv[1]
The final code files should reflect the contents of process_data-M2v1.py. if you are having trouble getting the code file written correctly, issue the command:
cp code/process_data-M2v1.py code/process_data.py
- Save your file, and now let's test out our code. Since we're doing some CPU-intensive work, let's grab a bash session on a compute node to offload work there:
module load python/3.8.5
bsub -q short_int -Is \
-We 15 \
-M 1000 -hl \
/bin/bash
- After the shell session starts, try running the script as shown:
python code/process_data.py
python code/process_data.py onefile twofile
python code/process_data.py somefile
python code/process_data.py data/file1.txt
You can see that the first two runs generate the usage note, while the 3rd one throws a different error altogether (file not found). Only the last one works correctly. Feel free to cancel the program by pressing Ctrl-C.
What have we done? Well, sys.argv are the arguments (options) passed to your script. [0] is the program name, and each option after that is [1], [2], etc. We call this positional parameters, or positional arguments, as their ordinal position matters.
Positional parameters are very limited, and unless someone knows this info a priori, it can be
confusing, frustrating, or disastrous if anyone (including you) gets things wrong. A better approach
is to use the built-in package parseargs
, which provides richer --key value
or --flag
approach to naming and parsing options.
- In your code file, remove lines 11-20, the parsing section we just put in, and replace it with the following:
# get inputs: either file name or data from STDIN
import argparse
parser = argparse.ArgumentParser(description='Process and sum values in input text files.')
parser.add_argument('--input', '-i', required=True,
help='Filename of data to parse')
args = parser.parse_args()
if not(args.input):
parser.error("Error: An input filename must be provided")
else:
in_filepath = args.input
The final code files should reflect the contents of process_data-M2v2.py. if you are having trouble getting the code file written correctly, issue the command:
cp code/process_data-M2v2.py code/process_data.py
6a. Save our script file. Before running, we need to ensure that argparse
is installed in
our system. If not, we'll install it locally.
conda info | grep argparse
# if above returns nothing, then...
pip install --user argparse
# if no error, continue...
6b. Now try running the code with different options...
$ python code/process_data.py
$ python code/process_data.py
$ python code/process_data.py
$ python code/process_data.py myfile
$ python code/process_data.py --help
$ python code/process_data.py --input
$ python code/process_data.py --input somefile
$ python code/process_data.py --input data/file1.txt
As you can see, the code is much more self-explanatory, and the input/error handling is much more robust and user-friendly. Even better, your program can now process any input file that you specify on the command line!
- Now that we have our code running, let's send off a batch job to process one of our input files:
bsub -q short \
-We 15 \
-M 1000 -hl \
python code/process_data.py --input data/file1.txt
# wait for a few seconds, and then enter the following to ensure it's off an running...
```{bash}
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1607989 rfreema RUN short_int rhrcscli1 rhrcsnod13 /bin/bash Aug 11 11:46
1607990 rfreema PEND short rhrcsnod13 *file1.txt Aug 11 11:47
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1607989 rfreema RUN short_int rhrcscli1 rhrcsnod13 /bin/bash Aug 11 11:46
1607990 rfreema RUN short rhrcsnod13 rhrcsnod15 *file1.txt Aug 11 11:47
You can see our first job is our shell -- us running and testing on the compute node. The 2nd job is our submitted one. The first time I ran bjobs, it hadn't not yet been dispatched by the scheduler. After a few more seconds, it was on its way.
If all is well, your program will finish in 5 to 10 minutes, and the output1.txt
file will be waiting for you.
We've created great flexibility in our processing script with allowing any input file. But
our output file is still limited to what we've hard-coded. Let's change that as well. Naturally,
let's add an --output filename
option. We could also follow standard behavior that if this
option is not provided, then the output is written to the screen, or STDOUT. But we're already
writing our status info (Line X: runningsum) there, so it would get confusing. One additional
change would be to write this status to the error stream (STDERR), so that the output and
error streams can be separated to different outputs. But this is more complex than needed
for this toy example, so let's simply require the --output filename
key/vaue pair:
- In your script file, add the additional lines between the ##new ##end new markers, and
don't forget to comment out the
out_filepath =...
statement:
# get inputs: either file name or data from STDIN
import parseargs
parser = argparse.ArgumentParser(description='Process and sum values in input text files.')
parser.add_argument('--input', '-i', required=True,
help='Filename of data to parse.')
#
## new lines
parser.add_argument('--output', '-o', required=True,
help='Filename of output for summed data.')
## end new
args = parser.parse_args()
if not(args.input):
parser.error("Error: An input filename must be provided.")
else:
in_filepath = args.input
#
## new lines
if not(args.output):
parser.error("Error: An output filename must be provided.")
else:
out_filepath = args.output
#out_filepath = "data/outfile1.txt"
## end new
# and don't forget to comment out...
#out_filepath = "data/outfile1.txt"
The final code files should reflect the contents of process_data-M2v3.py. if you are having trouble getting the code file written correctly, issue the command:
cp code/process_data-M2v3.py code/process_data.py
- Save your script file, and try running the file with different options:
$ python code/process_data.py
$ python code/process_data.py --help
$ python code/process_data.py --input data/file1.txt
$ python code/process_data.py --input data/file1.txt --option
$ python code/process_data.py --input data/file1.txt --output
$ # note that the filename is slightly different than what we've had before
$ python code/process_data.py --input data/file1.txt --output data/out1.txt
You should receive a number of different messages, from errors + usage info to full-on help text, to getting the script to run completely. Feel free to press Ctrl-c to terminate the script.
- Since our program is running once again, let's send off two jobs to the batch queues to get the results and to do a little sleuthing:
# good file
bsub -q short \
-We 15 \
-M 1000 -hl \
python code/process_data.py --input data/file1.txt --output data/out1.txt
# not-good file
bsub -q short \
-We 15 \
-M 1000 -hl \
python code/process_data.py --input data/file1.bad --output data/out1.bad
# wait for a few seconds, and then enter the following to ensure it's off an running...
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1607989 rfreema RUN short_int rhrcscli1 rhrcsnod13 /bin/bash Aug 11 11:46
1607990 rfreema RUN short rhrcsnod13 rhrcsnod15 *file1.txt Aug 11 11:47
1607991 rfreema RUN short rhrcsnod13 rhrcsnod15 */out1.txt Aug 11 11:51
1607992 rfreema RUN short rhrcsnod13 rhrcsnod15 */out1.bad Aug 11 11:51
The last two jobs are the ones we just submitted. If all is well, your programs will finish in 5 to 10 minutes, and the named files out1.txt
and out1.bad
will be waiting for you with your results.
Of course, nothing ever goes as perfectly as one expects. Or perhaps it does, most of the time. Since we have our output files, we can sanity check our results to ensure we're getting the well-formed data that we would expect.
- For this analysis, we should have output files that contains the same # of lines minus one as our input files. So let's check:
wc -l data/file1.txt
wc -l data/out*.*
OK, we have a problem, as the line counts are incorrect.
- Let's check the end of each file:
tail data/out*.*
Ok, these look relatively normal.
- Since we ran in batch, let's see if the scheduler caught anything:
# since the jobs have finished, we have to use bhist
bhist
So, it looks like of the three that we ran that went long, 2 of 3 ran well, and the 3rd (last) seems to have run shorter. Likely a fail.
- Let's look at the details for the last two, the good (longer) one and the bad (shorter) one:
# longer/good one first
bhist -l JOBID
# shorter/bad one next
bhist -l JOBID
OK, scanning down the details, we see that the longer one completed successfully, but the shorter one did not and exited with an error. Unfortunately, we cannot really see what the error is, so this is something that we're going to have to work on.
Next: Better Visibility with Progress of Jobs, Job Information, and Job Control