-
Notifications
You must be signed in to change notification settings - Fork 4
Restarting a killed job #2
Comments
Ideally, all the relevant information is included in the log file. We could read that in, find which sequences have been merged, and which sequences have been fixed as their own OTUs. Then the algorithm just proceeds from that point. Does that sounds like what you were hoping for? Did the job get killed because of time or space constraints? |
Time constraints. However, see my other issue with the huge log files. Any way you can not put the whole list of sequences in each OTU, but instead, just which representative OTU each sequence is associated with instead? Then you can read in the log file, get that information and rebuild the relationships from there. |
I have a few more clarifying questions before I want to dive in. First, do you have any memory problems? The current approach requires reading in the whole fasta and sequence table. I've thought of doing some fancy footwork to read in and process the data record-by-record. I'm wondering if this update should happen at the same time. re: this and log files (#3). I'm weighing two ways to do this. In both cases, I think having a separate debug log is the way to go. Way 1: Generate a log file that has a header with information about the run (parameter values, input/output filenames) and is continuously updated with information about new and merged OTUs. Each input sequence would create a log entry like "seqA is new OTU" or "seqB was merged into seqA". Way 2: At regular intervals (say, every 1000 input sequences), pickle an object that carries information about the parameter values, the input data, and the current algorithm progress. (This is basically the In way 1, it is easy to create the log and it is human-readable. Cons are that the log could still become large, that restarting a run would require re-reading all the data, and that it will be more work to write the code to restart the run. In way 2, the pickled object is a file that is a self-contained state of the run. You don't need any external files or anything to restart the run. Also, the size of the pickled object is fixed: its size shouldn't exceed the combined size of the input and final output. However, making that object will be more of a pain, and if you are running dbotu as a job on a cluster, you will probably capture stderr to a file anyway, which means you are effectively creating the log file from way 1. I think in #3 you suggested that the log file be basically like the membership file (one line per OTU; representative sequence ID is the first field, member sequence IDs are the rest of the fields). This saves space, but it's problematic to update on disk on the fly: you'd basically be writing the current membership file at every step in the algorithm. You also lose the temporal aspect: you don't necessarily know which sequence was processed last. I'm thinking to go with way 1. Questions? Thoughts? |
Now I'm leaning toward way 3, which is hacky-er but maintains the cleanness of the original code. In way 3, you read the log file, create an ad hoc sequence table where the member sequences have already been merged into their corresponding OTUs, and then spit out a suggested command to re-run dbOTU using that ad hoc table. Something like:
The only trick here is that, in theory, sequence A could have been named an OTU, then sequence B could have been genetically similar but distribution-different from A and made into an OTU, then a bunch of sequences were merged into A and/or B such that, when the algorithm is restarted, A and B at that point would be merged. This seems unlikely, so I might just put in a warning in the |
Making a re-start script would be fine as well, although I'm not sure I follow what will happen with the "ad-hoc" table. Is the table going to only contain OTU representative sequences and remove any that were merged by this point? If so, that's difficult, because you couldn't figure out which sequences were assigned into which OTU from this abbreviated table. Also, you will still have to go through the comparison of all those OTUs again, which is redundant and as you said might result in a different answer. I think the resulting output files should be identical whether the script is re-started or not. As long as that happens, it should be fine. |
Word. This does seem like a weird way to go about it. I reverted to way 1 and implemented the restart script on a branch. Have a look, maybe try it out. If it looks good, I'll merge it into the main repo. |
Great! Our cluster is going down for maintenance soon, so that will give me the opportunity to try this new restart command. I'll let you know how it goes. |
So I finally got around to checking out the restart function. I ran this: After 4 days, this process was killed, so I restarted it with: But I got this error: Any ideas? |
It seems like the log file is overwritten upon I restart. Does it loose information, depending on when/how it dies? |
I think I know what's going on, but I'll need to see the log file to debug
this. Maybe you can do a short run that gets killed (wall time of 10
minutes or something?).
I suppose I should also write something so that, if the parsing fails, the
original log file doesn't get overwritten. I'll look into that.
|
Well, I can't give you the original log file for the above error, but I have another job which ended with what seems to be a similar error:
The end of that log file looks like this:
dbotu3 likely died before finishing to print the last line of the log here. Can the restart script deal with partially printed lines (especially at the end)? |
Good news and bad news. The good news: This is the problem I was guessing had happened, and I made a solution. The The bad news:
Thoughts on this? |
It doesn't seem like the best thing to do to ask a human to intervene- users could potentially mess it up more. Can't you just write something as simple as "end" or ";" at the end of the line to indicate if the line is complete or not? Then just look for that and if you don't have it, don't use any info from that line. |
Interesting. I think your idea is that if the only kind of log error that happens are these truncated lines, people choose well-behaved sequence IDs, and folks are only using the restart script for this purpose, then it would be a bad idea to ask them to fiddle with that file. My concerns are:
My compromise: I added a As you'll see, I also updated the log file backup behavior so that it doesn't require any manual pushing of y/n keys. Let me know what you think! |
Sarah reported that:
My first guess is that this is because output flushing: your computer flushes output (i.e., actually writes python's print commands to disk) more often than the cluster does. I propose two tests:
If this works, I'll put that |
That didn't seem to work. I got this error:
|
My bad. (Turns out I made a fix; give that a shot. |
Looks like it works for both the main program and while running through the restart command. |
Great. And let me know how it works when a job gets killed. |
If dbotu3 is killed while it's running, there should be a way to recover so that you don't have to start over from scratch. Any ideas about how to make this happen?
The text was updated successfully, but these errors were encountered: