-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster transform-gisaid #88
Conversation
To process the gisaid data from Aug.20 on my laptop, the old script took about 30 minutes and multiple GB of RAM. This new script takes about 60 seconds and about 400MB of RAM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @ttung ! Thanks so much for all the work you've done here -- this is exactly what we desperately needed in the ncov-ingest pipeline. I just ran this script on a smaller subset of the GISAID data (200 rows), and the output metadata TSV, once sorted, matched the existing transform's output exactly.
I left a couple of minor inline comments and have a couple of higher level suggestions, one which should come as no surprise to you.
-
Can we output sorted a sorted metadata TSV using the existing sorting logic in
transform-gisaid
? This is crucial for our internal alerting system which notifies nCoV build maintainers of incoming metadata. I believe we'll also need to sort the additional info TSV file. @trvrb mentioned having a slight preference for the ability to also sort the output sequences FASTA file, but this is not a strict requirement. -
Can we send the printed warnings from your new
transform-gisaid
to Slack? The nCoV build maintainers are keen to know which annotations fail to apply (and, as we discussed previously, are currently working to systematically clean up missing/trailing tabs). I imagine we can easily pipe the stdout output to ournotify-slack
script in theingest-gisaid
file.
I hope I've addressed everything from our previous discussions on Slack. Please let me know if you have any questions, old or new. Thanks again!
I believe you've taken the right approach here, skipping invalid annotations and emitting a warning instead. PR #82, once implemented, should contain all the missing/surplus column logic for this repo. |
Updated. It now sorts metadata.tsv all the time. There is a new
You could tee stdout to a file, and then pipe the warning lines to slack. |
@emmahodcroft suggested that we accept the three column (i.e., missing the trailing tab) rows. We could:
|
Just as an FYI - I'm running this on WSL and just comparing to a file run with the old script. However, the new script is printing all 'Windows' line endings, the old script has 'Unix' line endings. I figure we probably want it to always use Unix file endings or else (if this is because I'm running on WSL - which seems kinda weird, but ok?) this might throw off any future Edit: actually this is a larger problem than I thought - I think it's making the diffs to detect the changes not work? |
Hi @ttung Thanks for the breakdown of the annotations! I'd definitely go for 3 or 4, I think warning would be a big improvement here, particularly if we can pipe to slack somehow (this is beyond my expertise but maybe they can be also written to a file that gets sent to slack, like the other files). Then we can fix these as they crop up, which will ensure intended behaviour! At the moment we do process 'blanking' entries (those that are I was concerned that it might make doing testing hard if there were lots of entries missing a trailing tab, but it seems there aren't too many, so maybe this isn't a concern. |
I think the right solution is to use |
So, I think it would be better for us to standardise this in the script - we don't want files with different file endings ending up in places like @ivan-aksamentov suggests adding a simple |
Would the runtime environment for things like
My concern for forcing a unix line separator is that someone who opens up the output in their local text editor will find the output unusable. How about we add an option for forcing the line terminator? Ordinary users wouldn't encounter this, but if you had an automated flow for |
They shouldn't but at the moment they do - since we are all running locally
while we can't run online like usual! This isn't our usual flow but it's
the back up so it's good if it can be consistent too. We do have to fall
back on it sometimes. However an option would be fine and within the
ncov-ingest script we can call this, anyone could modify their own fork or
call the script individually
…On Thu, 27 Aug 2020, 20:16 Tony Tung, ***@***.***> wrote:
So, I think it would be better for us to standardise this in the script -
we don't want files with different file endings ending up in places like
nextmeta from one day to the next, depending on what system the script runs.
Would the runtime environment for things like nextmeta really vary day to
day? That seems like a brittle setup.
@ivan-aksamentov <https://github.com/ivan-aksamentov> suggests adding a
simple line_terminator line would take care of this, so we always have
Unix-style endings and can rely on these being consistent. Could we add
that? (See commit here )
My concern for forcing a unix line separator is that someone who opens up
the output in their local text editor will find the output unusable. How
about we add an option for forcing the line terminator? Ordinary users
wouldn't encounter this, but if you had an automated flow for nextmeta,
that could run with the option.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADNA54S3R5GU2ZVC6T6ZOJDSC2PITANCNFSM4QJZLHJA>
.
|
With Ivan's change (to ensure consistent file endings) I am doing a check of this PR against yesterday's run with the 'old' script (using same GISAID download and 'comparison' files from the day before) - it's looking really good! Apart from the expected differences due to current different handling of missing 'trailing tabs' (just a few South Africa entries), I can't unfortunately compare However, this is a great outcome for testing (IMO) - looking really promising :) |
added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, thank you so much! My last request -- could you please edit your commit message to reflect reality, now that we are outputting sorted metadata and additional info TSVs?
@ nCoV team -- P.S. I also just re-ran the new edit checklist TODO item for myself.
|
Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (`Transform`) or filters the data (`Filter`). At the start of each pipeline is a `DataSource`. To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are: 1. The new script does _not_, by default, sort the sequences before outputting. It does perform the same deduplication process however. To sort the sequence data, use the option `--sorted-fasta`. 2. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed. Additionally, there is the option `--output-unix-newline`, which forces all output (fasta files and metadata csv files) to use unix newlines. Fixes #77
Updated! |
these were raising a flag in #88
I've run this again today against the 'old' pipeline, with Eli's changes to (Worth noting I'm running this locally in Ivan's script which doesn't include Slack or any auto upload-download, but should at least be a good test of |
Awesome! Let's plan to deploy this Monday morning. I'll add the bit where we pipe the output warnings to Slack. |
Superseded by #90 |
@ttung thank you so much for adding this feature! In some local testing run by the Nextstrain team, we found that running edit Specifically, we saw this in the additional_info.tsv. |
Description of proposed changes
Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (
Transform
) or filters the data (Filter
). At the start of each pipeline is aDataSource
.To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:
--sorted-fasta
.source-data/gisaid_annotations.tsv
differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.Additionally, there is the option
--output-unix-newline
, which forces all output (fasta files and metadata csv files) to use unix newlines.Related issue(s)
Fixes #77
Testing
To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:
source-data/gisaid_annotations.tsv
differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.