Skip to content
This repository has been archived by the owner on Apr 9, 2022. It is now read-only.

Restarting a killed job #2

Open
spacocha opened this issue Apr 19, 2017 · 19 comments
Open

Restarting a killed job #2

spacocha opened this issue Apr 19, 2017 · 19 comments

Comments

@spacocha
Copy link

If dbotu3 is killed while it's running, there should be a way to recover so that you don't have to start over from scratch. Any ideas about how to make this happen?

@swo
Copy link
Owner

swo commented Apr 19, 2017

Ideally, all the relevant information is included in the log file. We could read that in, find which sequences have been merged, and which sequences have been fixed as their own OTUs. Then the algorithm just proceeds from that point. Does that sounds like what you were hoping for?

Did the job get killed because of time or space constraints?

@spacocha
Copy link
Author

Time constraints. However, see my other issue with the huge log files. Any way you can not put the whole list of sequences in each OTU, but instead, just which representative OTU each sequence is associated with instead? Then you can read in the log file, get that information and rebuild the relationships from there.

@swo
Copy link
Owner

swo commented Apr 20, 2017

I have a few more clarifying questions before I want to dive in.

First, do you have any memory problems? The current approach requires reading in the whole fasta and sequence table. I've thought of doing some fancy footwork to read in and process the data record-by-record. I'm wondering if this update should happen at the same time.

re: this and log files (#3). I'm weighing two ways to do this. In both cases, I think having a separate debug log is the way to go.

Way 1: Generate a log file that has a header with information about the run (parameter values, input/output filenames) and is continuously updated with information about new and merged OTUs. Each input sequence would create a log entry like "seqA is new OTU" or "seqB was merged into seqA".

Way 2: At regular intervals (say, every 1000 input sequences), pickle an object that carries information about the parameter values, the input data, and the current algorithm progress. (This is basically the DBCaller object without the input/output filehandles.) Separately, the program will spit out some information about its progress (maybe sending that info to stderr).

In way 1, it is easy to create the log and it is human-readable. Cons are that the log could still become large, that restarting a run would require re-reading all the data, and that it will be more work to write the code to restart the run.

In way 2, the pickled object is a file that is a self-contained state of the run. You don't need any external files or anything to restart the run. Also, the size of the pickled object is fixed: its size shouldn't exceed the combined size of the input and final output. However, making that object will be more of a pain, and if you are running dbotu as a job on a cluster, you will probably capture stderr to a file anyway, which means you are effectively creating the log file from way 1.

I think in #3 you suggested that the log file be basically like the membership file (one line per OTU; representative sequence ID is the first field, member sequence IDs are the rest of the fields). This saves space, but it's problematic to update on disk on the fly: you'd basically be writing the current membership file at every step in the algorithm. You also lose the temporal aspect: you don't necessarily know which sequence was processed last.

I'm thinking to go with way 1. Questions? Thoughts?

@swo
Copy link
Owner

swo commented Apr 21, 2017

Now I'm leaning toward way 3, which is hacky-er but maintains the cleanness of the original code.

In way 3, you read the log file, create an ad hoc sequence table where the member sequences have already been merged into their corresponding OTUs, and then spit out a suggested command to re-run dbOTU using that ad hoc table. Something like:

$ ./restart_dbotu.py data/output/log.txt ad_hoc_table.tsv
The ad hoc sequence count table was written to ad_hoc_table.tsv
Run this command to start again:

dbotu.py ad_hoc_table.tsv /path/to/seq.fa --dist 0.1 --abund 10.0 --pval 0.0005 --log /path/to/log.txt --debug /path/to/debug_log.txt

NB: This command will overwrite your original log (and debug) files.
If you want to keep the originals, move them or change the filenames in this command.

The only trick here is that, in theory, sequence A could have been named an OTU, then sequence B could have been genetically similar but distribution-different from A and made into an OTU, then a bunch of sequences were merged into A and/or B such that, when the algorithm is restarted, A and B at that point would be merged. This seems unlikely, so I might just put in a warning in the restart_dbotu.py output that says, "Make sure that sequences A and B were selected as OTUs when you restarted dbOTU." If ever this doesn't work, then I can dive deeper.

@spacocha
Copy link
Author

spacocha commented Apr 21, 2017

Making a re-start script would be fine as well, although I'm not sure I follow what will happen with the "ad-hoc" table. Is the table going to only contain OTU representative sequences and remove any that were merged by this point? If so, that's difficult, because you couldn't figure out which sequences were assigned into which OTU from this abbreviated table. Also, you will still have to go through the comparison of all those OTUs again, which is redundant and as you said might result in a different answer. I think the resulting output files should be identical whether the script is re-started or not. As long as that happens, it should be fine.

@swo
Copy link
Owner

swo commented Apr 21, 2017

Word. This does seem like a weird way to go about it. I reverted to way 1 and implemented the restart script on a branch. Have a look, maybe try it out. If it looks good, I'll merge it into the main repo.

@spacocha
Copy link
Author

Great! Our cluster is going down for maintenance soon, so that will give me the opportunity to try this new restart command. I'll let you know how it goes.

@spacocha
Copy link
Author

So I finally got around to checking out the restart function. I ran this:
python ~/work/lib/dbotu3/dbotu.py unique.f0.mat unique.fa -o unique.output.txt -
l unique.output.log -m unique.membership -a 0

After 4 days, this process was killed, so I restarted it with:
python ~/work/lib/dbotu3/scripts/restart_dbotu.py unique.output.log

But I got this error:
Traceback (most recent call last):
File "/home-4/[email protected]/work/lib/dbotu3/scripts/restart_dbotu.py", line
98, in
restart_dbotu_run(args.log)
File "/home-4/[email protected]/work/lib/dbotu3/scripts/restart_dbotu.py", line
76, in restart_dbotu_run
raise RuntimeError('progress log item "{}" is not a string or 2-item list'.f
ormat(elt))
RuntimeError: progress log item "None" is not a string or 2-item list

Any ideas?

@spacocha
Copy link
Author

It seems like the log file is overwritten upon I restart. Does it loose information, depending on when/how it dies?

@swo
Copy link
Owner

swo commented May 30, 2017 via email

@spacocha
Copy link
Author

spacocha commented May 30, 2017

Well, I can't give you the original log file for the above error, but I have another job which ended with what seems to be a similar error:

Traceback (most recent call last):
  File "/home-4/[email protected]/work/lib/dbotu3/scripts/restart_dbotu.py", line 98, in <module>
    restart_dbotu_run(args.log)
  File "/home-4/[email protected]/work/lib/dbotu3/scripts/restart_dbotu.py", line 15, in restart_dbotu_run
    header, progress = yaml.load_all(f)
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/__init__.py", line 83, in load_all
    yield loader.get_data()
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/constructor.py", line 33, in get_data
    return self.construct_document(self.get_node())
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/composer.py", line 27, in get_node
    return self.compose_document()
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/composer.py", line 82, in compose_node
    node = self.compose_sequence_node(anchor)
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/composer.py", line 110, in compose_sequence_node
    while not self.check_event(SequenceEndEvent):
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/parser.py", line 384, in parse_block_sequence_entry
    if not self.check_token(BlockEntryToken, BlockEndToken):
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/scanner.py", line 248, in fetch_more_tokens
    return self.fetch_double()
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/scanner.py", line 652, in fetch_double
    self.fetch_flow_scalar(style='"')
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/scanner.py", line 663, in fetch_flow_scalar
    self.tokens.append(self.scan_flow_scalar(style))
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/scanner.py", line 1154, in scan_flow_scalar
    chunks.extend(self.scan_flow_scalar_spaces(double, start_mark))
  File "/home-4/[email protected]/.local/lib/python2.7/site-packages/yaml/scanner.py", line 1241, in scan_flow_scalar_spaces
    "found unexpected end of stream", self.get_mark())
yaml.scanner.ScannerError: while scanning a quoted scalar
  in "/scratch/groups/sprehei1/data/Mystic_lake_time_series_dir/raw_data/130823Alm_dir/dbotu3/unique.output.log", line 650838, column 16
found unexpected end of stream
  in "/scratch/groups/sprehei1/data/Mystic_lake_time_series_dir/raw_data/130823Alm_dir/dbotu3/unique.output.log", line 650838, column 24

The end of that log file looks like this:

[[email protected]@compute0078 dbotu3]$ tail unique.output.log
- ["seq99057", "seq2128"]
- ["seq99056", "seq2629"]
- ["seq99055", "seq4308"]
- ["seq99054", "seq442"]
- "seq99053"
- ["seq99052", "seq133"]
- ["seq99050", "seq186"]
- ["seq99048", "seq2994"]
- ["seq99047", "seq791"]
- ["seq99046", "seq4244[[email protected]@compute0078 dbotu3]$

dbotu3 likely died before finishing to print the last line of the log here. Can the restart script deal with partially printed lines (especially at the end)?

@swo
Copy link
Owner

swo commented Jun 2, 2017

Good news and bad news.

The good news: This is the problem I was guessing had happened, and I made a solution. The restart branch now has a new-and-improved restart script, which I think should work.

The bad news:

  • I previously used YAML for the log file, but (as you saw) the YAML parser doesn't do a good job of telling you what went wrong with your input file. In light of this, I switched to a simplified, tab-separated format. Unfortunately this means you'll need to start those jobs from scratch (or manually convert the YAML log files into the new-style log files).
  • I don't think I can write code that's smart enough to understand if a final line has been truncated. I could require that every line in the log file end with a newline, but this is tough to do with Python's regular file-reading functions. If I don't require that, then a final line "merge seqX into seqY" could be interpreted as complete (merge into seqY) or incomplete (if, say, seqYZ is in the data too).
  • So I'm thinking to ask the user to manually edit weird log files to avoid this kind of ambiguity (or, at least, ask a human to resolve it).

Thoughts on this?

@spacocha
Copy link
Author

spacocha commented Jun 2, 2017

It doesn't seem like the best thing to do to ask a human to intervene- users could potentially mess it up more. Can't you just write something as simple as "end" or ";" at the end of the line to indicate if the line is complete or not? Then just look for that and if you don't have it, don't use any info from that line.

@swo
Copy link
Owner

swo commented Jun 2, 2017

Interesting. I think your idea is that if the only kind of log error that happens are these truncated lines, people choose well-behaved sequence IDs, and folks are only using the restart script for this purpose, then it would be a bad idea to ask them to fiddle with that file.

My concerns are:

  • Using a record terminator like ; (or any particular character) would require telling users that they can't use that character in their sequence IDs, and then checking that they've not done it. (E.g., if someone named their sequences seq1;size=123 as per usearch then this parsing would fail.) These kinds of gotchas drive me crazy, and I don't want to foist them on someone.
  • I'm nervous about a restart script that does smart parsing (i.e., inferring when to drop a line) and the actual restarting. I worry that adding this kind of smart parsing will lead to some super-confusing error down the line, when someone expects that a bad log file would throw an error (rather than silently doing certain kinds of [potentially unwanted] fixes).

My compromise: I added a --drop_last option to the script (on a new push to the restart branch). This should deal with your use case. Check out the --help message and let me know what's un/clear.

As you'll see, I also updated the log file backup behavior so that it doesn't require any manual pushing of y/n keys.

Let me know what you think!

@swo
Copy link
Owner

swo commented Jun 16, 2017

Sarah reported that:

I tried a small test (I think on the command line) and it seems to be ok. But then, when I started over with a big analysis (I ran it for 100 hours), nothing happened in the end and everything was lost.

My first guess is that this is because output flushing: your computer flushes output (i.e., actually writes python's print commands to disk) more often than the cluster does. I propose two tests:

  1. Submit a job that would take a long time to run, but set the wall time very short. If my theory is right, then when the job dies, the log file will be empty.
  2. Do the same thing, but this time using the code on the new restart branch. It's the same as the old code, but I add flush=True to the log print commands. This should ensure that the every line of the log file is immediately written to disk (perhaps at a mild performance cost). If my theory is right, the log file should now exist.

If this works, I'll put that flush=True onto the main branch.

@spacocha
Copy link
Author

spacocha commented Jun 19, 2017

That didn't seem to work. I got this error:

Traceback (most recent call last):
  File "/home-4/[email protected]/work/lib/dbotu3/dbotu.py", line 397, in <module>
    call_otus(args.table, args.fasta, args.output, args.dist, args.abund, args.pval, log=args.log, membership=args.membership, debug=args.debug)
  File "/home-4/[email protected]/work/lib/dbotu3/dbotu.py", line 355, in call_otus
    caller._print_progress_log('---')
  File "/home-4/[email protected]/work/lib/dbotu3/dbotu.py", line 158, in _print_progress_log
    print(*fields, sep='\t', file=self.progress_log, flush=True)
TypeError: 'flush' is an invalid keyword argument for this function

@swo
Copy link
Owner

swo commented Jun 19, 2017

My bad. (Turns out import print_function imports the version of the print function that doesn't have the flush= keyword.)

I made a fix; give that a shot.

@spacocha
Copy link
Author

Looks like it works for both the main program and while running through the restart command.

@swo
Copy link
Owner

swo commented Jun 19, 2017

Great. And let me know how it works when a job gets killed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants