Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster transform-gisaid #88

Closed
wants to merge 1 commit into from
Closed

Faster transform-gisaid #88

wants to merge 1 commit into from

Conversation

ttung
Copy link
Contributor

@ttung ttung commented Aug 24, 2020

Description of proposed changes

Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (Transform) or filters the data (Filter). At the start of each pipeline is a DataSource.

To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:

  1. The new script does not, by default, sort the sequences before outputting. It does perform the same deduplication process however. To sort the sequence data, use the option --sorted-fasta.
  2. The new script interprets errors in source-data/gisaid_annotations.tsv differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.

Additionally, there is the option --output-unix-newline, which forces all output (fasta files and metadata csv files) to use unix newlines.

Related issue(s)

Fixes #77

Testing

To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:

  1. The new script does not sort the sequences before outputting. It does perform the same deduplication process however.
  2. The new script interprets errors in source-data/gisaid_annotations.tsv differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.

@ttung
Copy link
Contributor Author

ttung commented Aug 24, 2020

To process the gisaid data from Aug.20 on my laptop, the old script took about 30 minutes and multiple GB of RAM. This new script takes about 60 seconds and about 400MB of RAM.

Copy link
Contributor

@kairstenfay kairstenfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @ttung ! Thanks so much for all the work you've done here -- this is exactly what we desperately needed in the ncov-ingest pipeline. I just ran this script on a smaller subset of the GISAID data (200 rows), and the output metadata TSV, once sorted, matched the existing transform's output exactly.

I left a couple of minor inline comments and have a couple of higher level suggestions, one which should come as no surprise to you.

  1. Can we output sorted a sorted metadata TSV using the existing sorting logic in transform-gisaid? This is crucial for our internal alerting system which notifies nCoV build maintainers of incoming metadata. I believe we'll also need to sort the additional info TSV file. @trvrb mentioned having a slight preference for the ability to also sort the output sequences FASTA file, but this is not a strict requirement.

  2. Can we send the printed warnings from your new transform-gisaid to Slack? The nCoV build maintainers are keen to know which annotations fail to apply (and, as we discussed previously, are currently working to systematically clean up missing/trailing tabs). I imagine we can easily pipe the stdout output to our notify-slack script in the ingest-gisaid file.

I hope I've addressed everything from our previous discussions on Slack. Please let me know if you have any questions, old or new. Thanks again!

lib/utils/transformpipeline/transforms.py Outdated Show resolved Hide resolved
bin/transform-gisaid Outdated Show resolved Hide resolved
@kairstenfay
Copy link
Contributor

2\. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script.  It ignores all rows that do not have four columns.  As a result, some annotations are not processed.

I believe you've taken the right approach here, skipping invalid annotations and emitting a warning instead. PR #82, once implemented, should contain all the missing/surplus column logic for this repo.

@kairstenfay kairstenfay removed the request for review from eharkins August 27, 2020 01:15
@ttung
Copy link
Contributor Author

ttung commented Aug 27, 2020

Can we output sorted a sorted metadata TSV using the existing sorting logic in transform-gisaid? This is crucial for our internal alerting system which notifies nCoV build maintainers of incoming metadata. I believe we'll also need to sort the additional info TSV file. @trvrb mentioned having a slight preference for the ability to also sort the output sequences FASTA file, but this is not a strict requirement.

Updated. It now sorts metadata.tsv all the time. There is a new --sorted-fasta option that produces a sorted FASTA file, along with a corresponding Scary Message™ that it takes a lot of memory. It is still a lot more efficient than the previous solution, though, taking ~2.7GB of RAM to process a 2.7GB gisaid.ndjson file. This would probably fit in a github action machine for the time being.

Can we send the printed warnings from your new transform-gisaid to Slack? The nCoV build maintainers are keen to know which annotations fail to apply (and, as we discussed previously, are currently working to systematically clean up missing/trailing tabs). I imagine we can easily pipe the stdout output to our notify-slack script in the ingest-gisaid file.

You could tee stdout to a file, and then pipe the warning lines to slack.

@ttung ttung requested a review from kairstenfay August 27, 2020 04:48
@ttung
Copy link
Contributor Author

ttung commented Aug 27, 2020

I believe you've taken the right approach here, skipping invalid annotations and emitting a warning instead. PR #82, once implemented, should contain all the missing/surplus column logic for this repo.

@emmahodcroft suggested that we accept the three column (i.e., missing the trailing tab) rows.

We could:

  1. continue skipping the three column rows.
  2. accept the three column rows silently.
  3. accept the three column rows but warn about them.
  4. accept the three column rows, but warn if they never get applied. (handles the situation where the gisaid_epi_isl is not where it's supposed to be).

@emmahodcroft
Copy link
Member

emmahodcroft commented Aug 27, 2020

Just as an FYI - I'm running this on WSL and just comparing to a file run with the old script. However, the new script is printing all 'Windows' line endings, the old script has 'Unix' line endings. I figure we probably want it to always use Unix file endings or else (if this is because I'm running on WSL - which seems kinda weird, but ok?) this might throw off any future diffs that we wish to do, if I've run one file and someone else runs another... (also for people using nextmeta)

Edit: actually this is a larger problem than I thought - I think it's making the diffs to detect the changes not work?

@emmahodcroft
Copy link
Member

Hi @ttung Thanks for the breakdown of the annotations!

I'd definitely go for 3 or 4, I think warning would be a big improvement here, particularly if we can pipe to slack somehow (this is beyond my expertise but maybe they can be also written to a file that gets sent to slack, like the other files). Then we can fix these as they crop up, which will ensure intended behaviour!

At the moment we do process 'blanking' entries (those that are strain EPI_ISL key with no trailing tab). I think as long as we have a warning (for anything with just 3 columns), I don't have too strong feelings whether we also apply this or not - we should be able to fix it so that it is applied if we have a warning.

I was concerned that it might make doing testing hard if there were lots of entries missing a trailing tab, but it seems there aren't too many, so maybe this isn't a concern.

@ttung
Copy link
Contributor Author

ttung commented Aug 27, 2020

Just as an FYI - I'm running this on WSL and just comparing to a file run with the old script. However, the new script is printing all 'Windows' line endings, the old script has 'Unix' line endings. I figure we probably want it to always use Unix file endings or else (if this is because I'm running on WSL - which seems kinda weird, but ok?) this might throw off any future diffs that we wish to do, if I've run one file and someone else runs another... (also for people using nextmeta)

Edit: actually this is a larger problem than I thought - I think it's making the diffs to detect the changes not work?

I think the right solution is to use diff --strip-trailing-cr if you want to compare. The script should generate the correct line endings for the platform it is on.

@ttung ttung requested a review from kairstenfay August 27, 2020 17:09
@emmahodcroft
Copy link
Member

So, I think it would be better for us to standardise this in the script - we don't want files with different file endings ending up in places like nextmeta from one day to the next, depending on what system the script runs.

@ivan-aksamentov suggests adding a simple line_terminator line would take care of this, so we always have Unix-style endings and can rely on these being consistent. Could we add that? (See commit here )

@ttung
Copy link
Contributor Author

ttung commented Aug 27, 2020

So, I think it would be better for us to standardise this in the script - we don't want files with different file endings ending up in places like nextmeta from one day to the next, depending on what system the script runs.

Would the runtime environment for things like nextmeta really vary day to day? That seems like a brittle setup.

@ivan-aksamentov suggests adding a simple line_terminator line would take care of this, so we always have Unix-style endings and can rely on these being consistent. Could we add that? (See commit here )

My concern for forcing a unix line separator is that someone who opens up the output in their local text editor will find the output unusable. How about we add an option for forcing the line terminator? Ordinary users wouldn't encounter this, but if you had an automated flow for nextmeta, that could run with the option.

@emmahodcroft
Copy link
Member

emmahodcroft commented Aug 27, 2020 via email

@emmahodcroft
Copy link
Member

emmahodcroft commented Aug 27, 2020

With Ivan's change (to ensure consistent file endings) I am doing a check of this PR against yesterday's run with the 'old' script (using same GISAID download and 'comparison' files from the day before) - it's looking really good! Apart from the expected differences due to current different handling of missing 'trailing tabs' (just a few South Africa entries), metadata.tsv, metadata-additions.tsv, metadata-changes.txt, and additional-info-changes.txt are all the same!

I can't unfortunately compare location-hierarchy-additions.tsv as I messed up yesterday's file (yesterday).

However, this is a great outcome for testing (IMO) - looking really promising :)

@ttung
Copy link
Contributor Author

ttung commented Aug 27, 2020

added --output-unix-newline option which enforces unix newlines for all files (csv files and fasta files).

Copy link
Contributor

@kairstenfay kairstenfay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, thank you so much! My last request -- could you please edit your commit message to reflect reality, now that we are outputting sorted metadata and additional info TSVs?

@kairstenfay
Copy link
Contributor

kairstenfay commented Aug 27, 2020

@ nCoV team -- P.S. I also just re-ran the new transform-gisaid script, and I found no differences in the output files between this one and the version on master. I tested with the first 2000 rows of gisaid.ndjson.

edit checklist TODO item for myself.

  • pipe stdout output to Slack to notify of "bad" annotations

Processing in transform-gisaid is represented as a pipeline of steps.  Each step either manipulates the data (`Transform`) or filters the data (`Filter`).  At the start of each pipeline is a `DataSource`.

To test this, I processed the same gisaid dataset with both the old script and the new script.  The key differences in the output are:

1. The new script does _not_, by default, sort the sequences before outputting.  It does perform the same deduplication process however.  To sort the sequence data, use the option `--sorted-fasta`.
2. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script.  It ignores all rows that do not have four columns.  As a result, some annotations are not processed.

Additionally, there is the option `--output-unix-newline`, which forces all output (fasta files and metadata csv files) to use unix newlines.

Fixes #77
@ttung
Copy link
Contributor Author

ttung commented Aug 27, 2020

Once again, thank you so much! My last request -- could you please edit your commit message to reflect reality, now that we are outputting sorted metadata and additional info TSVs?

Updated!

eharkins added a commit that referenced this pull request Aug 28, 2020
these were raising a flag in
#88
@emmahodcroft
Copy link
Member

emmahodcroft commented Aug 28, 2020

I've run this again today against the 'old' pipeline, with Eli's changes to gisaid_annotations to take care of the 'bad' entries and with the update to standardise line endings - and I'm getting totally identical files!

(Worth noting I'm running this locally in Ivan's script which doesn't include Slack or any auto upload-download, but should at least be a good test of transform itself and the things that depend on that output.)

@kairstenfay
Copy link
Contributor

Awesome! Let's plan to deploy this Monday morning. I'll add the bit where we pipe the output warnings to Slack.

@kairstenfay
Copy link
Contributor

Superseded by #90

@kairstenfay
Copy link
Contributor

kairstenfay commented Aug 31, 2020

added --output-unix-newline option which enforces unix newlines for all files (csv files and fasta files).

@ttung thank you so much for adding this feature! In some local testing run by the Nextstrain team, we found that running transform-gisaid without --output-unix-newline on both Mac and Ubuntu produced Windows line endings. We wanted to alert you of this because we believe that was not the intended functionality.

edit Specifically, we saw this in the additional_info.tsv.
I'm pushing up a PR to fix this, and I'd love your review on it.

@ttung ttung mentioned this pull request Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite transform-gisaid to perform a streaming transform
3 participants