Skip to content

Latest commit

 

History

History
363 lines (254 loc) · 11.8 KB

2.Input_outputs.md

File metadata and controls

363 lines (254 loc) · 11.8 KB
layout title author
default
Set Up Your Script's Inputs and Outputs
Bob Freeman

Previous: Take baby steps: Run Your Script Code from the Terminal


Set Up Your Script's Inputs and Outputs

Objectives

  • Learn how to abstract your code to work on any input file
  • Learn how to have your code create unique output files
  • Learn how to look at job history details

Now that we have the code running correctly in batch with one input file, we will make some tweaks so that we can ask the code to process any input file. And similarly, we will change the code so that a different output file is made for each input file. And finally, once our code has completed or if there are any problems, we'll want to understand at multiple levels what may have been at fault.

Process Any Input File

One of the hallmarks of saving time, automating work, and scaling one's analyses is making one's scripts as flexible as possible. Using hard-coded paths to files for input and output is incredibly limiting, esp. as one has to edit code for a different file. The work on abstracting the code to pass parameters, options, or command-line arguments opens the possibility of changing the behavior of one's code with a simple switch (flag).

You've seen this before:

program --verbose --input myfile --output myotherfile

or something similar. The options (--flags) change the behavior of the program, and items like myfile and myotherfile are arguments or flag options (see XXX). We'll now do this with our own program, first the quick and dirty way; and afterwards the more flexible way.

  1. Open the script in your favorite editor.

  2. Comment out line9 (our in_filepath statement). After this line, insert the following:

    #in_filepath = "data/file1.txt"
    # check for program name & input file 
    if len(sys.argv) != 2:
        print('Error: incorrect number of arguments')
        print()
        print('Usage: {} inputfile'.format(sys.argv[0]))
        print()
        sys.exit(1)
    else:
        #OK, we have the right #. Grab the input filename
        in_filepath = sys.argv[1]

The final code files should reflect the contents of process_data-M2v1.py. if you are having trouble getting the code file written correctly, issue the command:

cp code/process_data-M2v1.py code/process_data.py
  1. Save your file, and now let's test out our code. Since we're doing some CPU-intensive work, let's grab a bash session on a compute node to offload work there:
module load python/3.8.5
bsub -q short_int -Is \
     -We 15 \
     -M 1000 -hl \
     /bin/bash
  1. After the shell session starts, try running the script as shown:
python code/process_data.py

python code/process_data.py onefile twofile

python code/process_data.py somefile

python code/process_data.py data/file1.txt

You can see that the first two runs generate the usage note, while the 3rd one throws a different error altogether (file not found). Only the last one works correctly. Feel free to cancel the program by pressing Ctrl-C.

What have we done? Well, sys.argv are the arguments (options) passed to your script. [0] is the program name, and each option after that is [1], [2], etc. We call this positional parameters, or positional arguments, as their ordinal position matters.

Positional parameters are very limited, and unless someone knows this info a priori, it can be confusing, frustrating, or disastrous if anyone (including you) gets things wrong. A better approach is to use the built-in package parseargs, which provides richer --key value or --flag approach to naming and parsing options.

  1. In your code file, remove lines 11-20, the parsing section we just put in, and replace it with the following:
    
    # get inputs: either file name or data from STDIN
    import argparse
    parser = argparse.ArgumentParser(description='Process and sum values in input text files.')
    parser.add_argument('--input', '-i', required=True,
                        help='Filename of data to parse')
    args = parser.parse_args()
    
    if not(args.input):
        parser.error("Error: An input filename must be provided")
    else:
        in_filepath = args.input

The final code files should reflect the contents of process_data-M2v2.py. if you are having trouble getting the code file written correctly, issue the command:

cp code/process_data-M2v2.py code/process_data.py

6a. Save our script file. Before running, we need to ensure that argparse is installed in our system. If not, we'll install it locally.

conda info | grep argparse

# if above returns nothing, then...
pip install --user argparse

# if no error, continue...

6b. Now try running the code with different options...


$ python code/process_data.py

$ python code/process_data.py

$ python code/process_data.py

$ python code/process_data.py myfile

$ python code/process_data.py --help

$ python code/process_data.py --input

$ python code/process_data.py --input somefile

$ python code/process_data.py --input data/file1.txt

As you can see, the code is much more self-explanatory, and the input/error handling is much more robust and user-friendly. Even better, your program can now process any input file that you specify on the command line!

  1. Now that we have our code running, let's send off a batch job to process one of our input files:
bsub -q short \
     -We 15 \
     -M 1000 -hl \
     python code/process_data.py --input data/file1.txt

# wait for a few seconds, and then enter the following to ensure it's off an running...

```{bash}
$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1607989    rfreema RUN   short_int  rhrcscli1   rhrcsnod13  /bin/bash  Aug 11 11:46
1607990    rfreema PEND  short      rhrcsnod13              *file1.txt Aug 11 11:47

$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1607989    rfreema RUN   short_int  rhrcscli1   rhrcsnod13  /bin/bash  Aug 11 11:46
1607990    rfreema RUN   short      rhrcsnod13  rhrcsnod15  *file1.txt Aug 11 11:47

You can see our first job is our shell -- us running and testing on the compute node. The 2nd job is our submitted one. The first time I ran bjobs, it hadn't not yet been dispatched by the scheduler. After a few more seconds, it was on its way.

If all is well, your program will finish in 5 to 10 minutes, and the output1.txt file will be waiting for you.

Create Unique Output Files

We've created great flexibility in our processing script with allowing any input file. But our output file is still limited to what we've hard-coded. Let's change that as well. Naturally, let's add an --output filename option. We could also follow standard behavior that if this option is not provided, then the output is written to the screen, or STDOUT. But we're already writing our status info (Line X: runningsum) there, so it would get confusing. One additional change would be to write this status to the error stream (STDERR), so that the output and error streams can be separated to different outputs. But this is more complex than needed for this toy example, so let's simply require the --output filename key/vaue pair:

  1. In your script file, add the additional lines between the ##new ##end new markers, and don't forget to comment out the out_filepath =... statement:
    # get inputs: either file name or data from STDIN
    import parseargs
    parser = argparse.ArgumentParser(description='Process and sum values in input text files.')
    parser.add_argument('--input', '-i', required=True,
                        help='Filename of data to parse.')
    #
    ## new lines
    parser.add_argument('--output', '-o', required=True,
                        help='Filename of output for summed data.')
    ## end new
    
    args = parser.parse_args()
    
    if not(args.input):
        parser.error("Error: An input filename must be provided.")
    else:
        in_filepath = args.input
    #
    ## new lines
    if not(args.output):
        parser.error("Error: An output filename must be provided.")
    else:
        out_filepath = args.output
        #out_filepath = "data/outfile1.txt"
    ## end new
    
    # and don't forget to comment out...
    #out_filepath = "data/outfile1.txt"

The final code files should reflect the contents of process_data-M2v3.py. if you are having trouble getting the code file written correctly, issue the command:

cp code/process_data-M2v3.py code/process_data.py
  1. Save your script file, and try running the file with different options:
$ python code/process_data.py

$ python code/process_data.py --help

$ python code/process_data.py --input data/file1.txt

$ python code/process_data.py --input data/file1.txt --option

$ python code/process_data.py --input data/file1.txt --output 

$ # note that the filename is slightly different than what we've had before
$ python code/process_data.py --input data/file1.txt --output data/out1.txt

You should receive a number of different messages, from errors + usage info to full-on help text, to getting the script to run completely. Feel free to press Ctrl-c to terminate the script.

  1. Since our program is running once again, let's send off two jobs to the batch queues to get the results and to do a little sleuthing:

# good file
bsub -q short \
     -We 15 \
     -M 1000 -hl \
     python code/process_data.py --input data/file1.txt --output data/out1.txt

# not-good file
bsub -q short \
     -We 15 \
     -M 1000 -hl \
     python code/process_data.py --input data/file1.bad --output data/out1.bad

# wait for a few seconds, and then enter the following to ensure it's off an running...
$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1607989    rfreema RUN   short_int  rhrcscli1   rhrcsnod13  /bin/bash  Aug 11 11:46
1607990    rfreema RUN   short      rhrcsnod13  rhrcsnod15  *file1.txt Aug 11 11:47
1607991    rfreema RUN   short      rhrcsnod13  rhrcsnod15  */out1.txt Aug 11 11:51
1607992    rfreema RUN   short      rhrcsnod13  rhrcsnod15  */out1.bad Aug 11 11:51

The last two jobs are the ones we just submitted. If all is well, your programs will finish in 5 to 10 minutes, and the named files out1.txt and out1.bad will be waiting for you with your results.

Looking at Job History Details and Sleuthing

Of course, nothing ever goes as perfectly as one expects. Or perhaps it does, most of the time. Since we have our output files, we can sanity check our results to ensure we're getting the well-formed data that we would expect.

  1. For this analysis, we should have output files that contains the same # of lines minus one as our input files. So let's check:
wc -l data/file1.txt

wc -l data/out*.*

OK, we have a problem, as the line counts are incorrect.

  1. Let's check the end of each file:
tail data/out*.*

Ok, these look relatively normal.

  1. Since we ran in batch, let's see if the scheduler caught anything:
# since the jobs have finished, we have to use bhist
bhist

So, it looks like of the three that we ran that went long, 2 of 3 ran well, and the 3rd (last) seems to have run shorter. Likely a fail.

  1. Let's look at the details for the last two, the good (longer) one and the bad (shorter) one:
# longer/good one first
bhist -l JOBID

# shorter/bad one next
bhist -l JOBID

OK, scanning down the details, we see that the longer one completed successfully, but the shorter one did not and exited with an error. Unfortunately, we cannot really see what the error is, so this is something that we're going to have to work on.


Next: Better Visibility with Progress of Jobs, Job Information, and Job Control