Faster transform-gisaid #88

ttung · 2020-08-24T18:34:40Z

Description of proposed changes

Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (Transform) or filters the data (Filter). At the start of each pipeline is a DataSource.

To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:

The new script does not, by default, sort the sequences before outputting. It does perform the same deduplication process however. To sort the sequence data, use the option --sorted-fasta.
The new script interprets errors in source-data/gisaid_annotations.tsv differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.

Additionally, there is the option --output-unix-newline, which forces all output (fasta files and metadata csv files) to use unix newlines.

Related issue(s)

Fixes #77

Testing

To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are:

The new script does not sort the sequences before outputting. It does perform the same deduplication process however.
The new script interprets errors in source-data/gisaid_annotations.tsv differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed.

ttung · 2020-08-24T18:36:04Z

To process the gisaid data from Aug.20 on my laptop, the old script took about 30 minutes and multiple GB of RAM. This new script takes about 60 seconds and about 400MB of RAM.

kairstenfay

This looks great, @ttung ! Thanks so much for all the work you've done here -- this is exactly what we desperately needed in the ncov-ingest pipeline. I just ran this script on a smaller subset of the GISAID data (200 rows), and the output metadata TSV, once sorted, matched the existing transform's output exactly.

I left a couple of minor inline comments and have a couple of higher level suggestions, one which should come as no surprise to you.

Can we output sorted a sorted metadata TSV using the existing sorting logic in transform-gisaid? This is crucial for our internal alerting system which notifies nCoV build maintainers of incoming metadata. I believe we'll also need to sort the additional info TSV file. @trvrb mentioned having a slight preference for the ability to also sort the output sequences FASTA file, but this is not a strict requirement.
Can we send the printed warnings from your new transform-gisaid to Slack? The nCoV build maintainers are keen to know which annotations fail to apply (and, as we discussed previously, are currently working to systematically clean up missing/trailing tabs). I imagine we can easily pipe the stdout output to our notify-slack script in the ingest-gisaid file.

I hope I've addressed everything from our previous discussions on Slack. Please let me know if you have any questions, old or new. Thanks again!

lib/utils/transformpipeline/transforms.py

bin/transform-gisaid

kairstenfay · 2020-08-27T01:14:25Z

2\. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script.  It ignores all rows that do not have four columns.  As a result, some annotations are not processed.

I believe you've taken the right approach here, skipping invalid annotations and emitting a warning instead. PR #82, once implemented, should contain all the missing/surplus column logic for this repo.

ttung · 2020-08-27T04:48:42Z

Can we output sorted a sorted metadata TSV using the existing sorting logic in transform-gisaid? This is crucial for our internal alerting system which notifies nCoV build maintainers of incoming metadata. I believe we'll also need to sort the additional info TSV file. @trvrb mentioned having a slight preference for the ability to also sort the output sequences FASTA file, but this is not a strict requirement.

Updated. It now sorts metadata.tsv all the time. There is a new --sorted-fasta option that produces a sorted FASTA file, along with a corresponding Scary Message™ that it takes a lot of memory. It is still a lot more efficient than the previous solution, though, taking ~2.7GB of RAM to process a 2.7GB gisaid.ndjson file. This would probably fit in a github action machine for the time being.

Can we send the printed warnings from your new transform-gisaid to Slack? The nCoV build maintainers are keen to know which annotations fail to apply (and, as we discussed previously, are currently working to systematically clean up missing/trailing tabs). I imagine we can easily pipe the stdout output to our notify-slack script in the ingest-gisaid file.

You could tee stdout to a file, and then pipe the warning lines to slack.

ttung · 2020-08-27T04:52:03Z

I believe you've taken the right approach here, skipping invalid annotations and emitting a warning instead. PR #82, once implemented, should contain all the missing/surplus column logic for this repo.

@emmahodcroft suggested that we accept the three column (i.e., missing the trailing tab) rows.

We could:

continue skipping the three column rows.
accept the three column rows silently.
accept the three column rows but warn about them.
accept the three column rows, but warn if they never get applied. (handles the situation where the gisaid_epi_isl is not where it's supposed to be).

lib/utils/transformpipeline/transforms.py

emmahodcroft · 2020-08-27T16:12:27Z

Just as an FYI - I'm running this on WSL and just comparing to a file run with the old script. However, the new script is printing all 'Windows' line endings, the old script has 'Unix' line endings. I figure we probably want it to always use Unix file endings or else (if this is because I'm running on WSL - which seems kinda weird, but ok?) this might throw off any future diffs that we wish to do, if I've run one file and someone else runs another... (also for people using nextmeta)

Edit: actually this is a larger problem than I thought - I think it's making the diffs to detect the changes not work?

emmahodcroft · 2020-08-27T17:07:29Z

Hi @ttung Thanks for the breakdown of the annotations!

I'd definitely go for 3 or 4, I think warning would be a big improvement here, particularly if we can pipe to slack somehow (this is beyond my expertise but maybe they can be also written to a file that gets sent to slack, like the other files). Then we can fix these as they crop up, which will ensure intended behaviour!

At the moment we do process 'blanking' entries (those that are strain EPI_ISL key with no trailing tab). I think as long as we have a warning (for anything with just 3 columns), I don't have too strong feelings whether we also apply this or not - we should be able to fix it so that it is applied if we have a warning.

I was concerned that it might make doing testing hard if there were lots of entries missing a trailing tab, but it seems there aren't too many, so maybe this isn't a concern.

ttung · 2020-08-27T17:07:41Z

Just as an FYI - I'm running this on WSL and just comparing to a file run with the old script. However, the new script is printing all 'Windows' line endings, the old script has 'Unix' line endings. I figure we probably want it to always use Unix file endings or else (if this is because I'm running on WSL - which seems kinda weird, but ok?) this might throw off any future diffs that we wish to do, if I've run one file and someone else runs another... (also for people using nextmeta)

Edit: actually this is a larger problem than I thought - I think it's making the diffs to detect the changes not work?

I think the right solution is to use diff --strip-trailing-cr if you want to compare. The script should generate the correct line endings for the platform it is on.

emmahodcroft · 2020-08-27T17:59:31Z

So, I think it would be better for us to standardise this in the script - we don't want files with different file endings ending up in places like nextmeta from one day to the next, depending on what system the script runs.

@ivan-aksamentov suggests adding a simple line_terminator line would take care of this, so we always have Unix-style endings and can rely on these being consistent. Could we add that? (See commit here )

ttung · 2020-08-27T18:16:25Z

So, I think it would be better for us to standardise this in the script - we don't want files with different file endings ending up in places like nextmeta from one day to the next, depending on what system the script runs.

Would the runtime environment for things like nextmeta really vary day to day? That seems like a brittle setup.

@ivan-aksamentov suggests adding a simple line_terminator line would take care of this, so we always have Unix-style endings and can rely on these being consistent. Could we add that? (See commit here )

My concern for forcing a unix line separator is that someone who opens up the output in their local text editor will find the output unusable. How about we add an option for forcing the line terminator? Ordinary users wouldn't encounter this, but if you had an automated flow for nextmeta, that could run with the option.

emmahodcroft · 2020-08-27T18:50:11Z

They shouldn't but at the moment they do - since we are all running locally while we can't run online like usual! This isn't our usual flow but it's the back up so it's good if it can be consistent too. We do have to fall back on it sometimes. However an option would be fine and within the ncov-ingest script we can call this, anyone could modify their own fork or call the script individually

…

On Thu, 27 Aug 2020, 20:16 Tony Tung, ***@***.***> wrote: So, I think it would be better for us to standardise this in the script - we don't want files with different file endings ending up in places like nextmeta from one day to the next, depending on what system the script runs. Would the runtime environment for things like nextmeta really vary day to day? That seems like a brittle setup. @ivan-aksamentov <https://github.com/ivan-aksamentov> suggests adding a simple line_terminator line would take care of this, so we always have Unix-style endings and can rely on these being consistent. Could we add that? (See commit here ) My concern for forcing a unix line separator is that someone who opens up the output in their local text editor will find the output unusable. How about we add an option for forcing the line terminator? Ordinary users wouldn't encounter this, but if you had an automated flow for nextmeta, that could run with the option. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADNA54S3R5GU2ZVC6T6ZOJDSC2PITANCNFSM4QJZLHJA> .

emmahodcroft · 2020-08-27T20:04:49Z

With Ivan's change (to ensure consistent file endings) I am doing a check of this PR against yesterday's run with the 'old' script (using same GISAID download and 'comparison' files from the day before) - it's looking really good! Apart from the expected differences due to current different handling of missing 'trailing tabs' (just a few South Africa entries), metadata.tsv, metadata-additions.tsv, metadata-changes.txt, and additional-info-changes.txt are all the same!

I can't unfortunately compare location-hierarchy-additions.tsv as I messed up yesterday's file (yesterday).

However, this is a great outcome for testing (IMO) - looking really promising :)

ttung · 2020-08-27T20:44:11Z

added --output-unix-newline option which enforces unix newlines for all files (csv files and fasta files).

kairstenfay

Once again, thank you so much! My last request -- could you please edit your commit message to reflect reality, now that we are outputting sorted metadata and additional info TSVs?

kairstenfay · 2020-08-27T23:06:58Z

@ nCoV team -- P.S. I also just re-ran the new transform-gisaid script, and I found no differences in the output files between this one and the version on master. I tested with the first 2000 rows of gisaid.ndjson.

edit checklist TODO item for myself.

pipe stdout output to Slack to notify of "bad" annotations

Processing in transform-gisaid is represented as a pipeline of steps. Each step either manipulates the data (`Transform`) or filters the data (`Filter`). At the start of each pipeline is a `DataSource`. To test this, I processed the same gisaid dataset with both the old script and the new script. The key differences in the output are: 1. The new script does _not_, by default, sort the sequences before outputting. It does perform the same deduplication process however. To sort the sequence data, use the option `--sorted-fasta`. 2. The new script interprets errors in `source-data/gisaid_annotations.tsv` differently than the old script. It ignores all rows that do not have four columns. As a result, some annotations are not processed. Additionally, there is the option `--output-unix-newline`, which forces all output (fasta files and metadata csv files) to use unix newlines. Fixes #77

ttung · 2020-08-27T23:16:49Z

Once again, thank you so much! My last request -- could you please edit your commit message to reflect reality, now that we are outputting sorted metadata and additional info TSVs?

Updated!

these were raising a flag in #88

emmahodcroft · 2020-08-28T15:46:13Z

I've run this again today against the 'old' pipeline, with Eli's changes to gisaid_annotations to take care of the 'bad' entries and with the update to standardise line endings - and I'm getting totally identical files!

(Worth noting I'm running this locally in Ivan's script which doesn't include Slack or any auto upload-download, but should at least be a good test of transform itself and the things that depend on that output.)

kairstenfay · 2020-08-28T17:30:29Z

Awesome! Let's plan to deploy this Monday morning. I'll add the bit where we pipe the output warnings to Slack.

kairstenfay · 2020-08-29T00:02:55Z

Superseded by #90

kairstenfay · 2020-08-31T19:20:17Z

added --output-unix-newline option which enforces unix newlines for all files (csv files and fasta files).

@ttung thank you so much for adding this feature! In some local testing run by the Nextstrain team, we found that running transform-gisaid without --output-unix-newline on both Mac and Ubuntu produced Windows line endings. We wanted to alert you of this because we believe that was not the intended functionality.

edit Specifically, we saw this in the additional_info.tsv.
I'm pushing up a PR to fix this, and I'd love your review on it.

kairstenfay requested review from kairstenfay, joverlee521, tsibley and eharkins August 24, 2020 20:25

kairstenfay suggested changes Aug 27, 2020

View reviewed changes

lib/utils/transformpipeline/transforms.py Outdated Show resolved Hide resolved

bin/transform-gisaid Outdated Show resolved Hide resolved

kairstenfay removed the request for review from eharkins August 27, 2020 01:15

ttung requested a review from kairstenfay August 27, 2020 04:48

kairstenfay reviewed Aug 27, 2020

View reviewed changes

lib/utils/transformpipeline/transforms.py Outdated Show resolved Hide resolved

ttung requested a review from kairstenfay August 27, 2020 17:09

kairstenfay reviewed Aug 27, 2020

View reviewed changes

kairstenfay approved these changes Aug 27, 2020

View reviewed changes

eharkins added a commit that referenced this pull request Aug 28, 2020

fix gisaid annotations;

26b66b2

these were raising a flag in #88

kairstenfay mentioned this pull request Aug 28, 2020

Faster transform-gisaid #90

Merged

kairstenfay closed this Aug 29, 2020

ttung mentioned this pull request Sep 17, 2020

output unused annotations #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster transform-gisaid #88

Faster transform-gisaid #88

ttung commented Aug 24, 2020 •

edited

Loading

ttung commented Aug 24, 2020

kairstenfay left a comment •

edited

Loading

kairstenfay commented Aug 27, 2020

ttung commented Aug 27, 2020 •

edited

Loading

ttung commented Aug 27, 2020

emmahodcroft commented Aug 27, 2020 •

edited

Loading

emmahodcroft commented Aug 27, 2020

ttung commented Aug 27, 2020

emmahodcroft commented Aug 27, 2020

ttung commented Aug 27, 2020

emmahodcroft commented Aug 27, 2020 via email

emmahodcroft commented Aug 27, 2020 •

edited

Loading

ttung commented Aug 27, 2020

kairstenfay left a comment

kairstenfay commented Aug 27, 2020 •

edited

Loading

ttung commented Aug 27, 2020

emmahodcroft commented Aug 28, 2020 •

edited

Loading

kairstenfay commented Aug 28, 2020

kairstenfay commented Aug 29, 2020

kairstenfay commented Aug 31, 2020 •

edited

Loading

Faster transform-gisaid #88

Faster transform-gisaid #88

Conversation

ttung commented Aug 24, 2020 • edited Loading

Description of proposed changes

Related issue(s)

Testing

ttung commented Aug 24, 2020

kairstenfay left a comment • edited Loading

Choose a reason for hiding this comment

kairstenfay commented Aug 27, 2020

ttung commented Aug 27, 2020 • edited Loading

ttung commented Aug 27, 2020

emmahodcroft commented Aug 27, 2020 • edited Loading

emmahodcroft commented Aug 27, 2020

ttung commented Aug 27, 2020

emmahodcroft commented Aug 27, 2020

ttung commented Aug 27, 2020

emmahodcroft commented Aug 27, 2020 via email

emmahodcroft commented Aug 27, 2020 • edited Loading

ttung commented Aug 27, 2020

kairstenfay left a comment

Choose a reason for hiding this comment

kairstenfay commented Aug 27, 2020 • edited Loading

ttung commented Aug 27, 2020

emmahodcroft commented Aug 28, 2020 • edited Loading

kairstenfay commented Aug 28, 2020

kairstenfay commented Aug 29, 2020

kairstenfay commented Aug 31, 2020 • edited Loading

ttung commented Aug 24, 2020 •

edited

Loading

kairstenfay left a comment •

edited

Loading

ttung commented Aug 27, 2020 •

edited

Loading

emmahodcroft commented Aug 27, 2020 •

edited

Loading

emmahodcroft commented Aug 27, 2020 •

edited

Loading

kairstenfay commented Aug 27, 2020 •

edited

Loading

emmahodcroft commented Aug 28, 2020 •

edited

Loading

kairstenfay commented Aug 31, 2020 •

edited

Loading