Skip to content

Commit

Permalink
wip versioned resources docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jameshadfield committed Jan 4, 2024
1 parent 87a0c76 commit 5bf6e4b
Show file tree
Hide file tree
Showing 5 changed files with 108 additions and 1 deletion.
104 changes: 104 additions & 0 deletions src/guides/versions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
=================================
Viewing previous versions of data
=================================


.. contents:: Sections in this document
:local:
:depth: 2

TODO
====
* Change URLs to nextstrain.org once it's in production
* Reduce size of screenshot png
* Fill in TKTK sections

Overview
========

Analyses are a snapshot in time, and for most of our `core Nextstrain datasets
<https://nextstrain.org/pathogens>`__ we update this snapshot frequently, often
even daily. When you view a dataset such as the seasonal influenza build
`flu/seasonal/h3n2/ha/6y <https://dev.nextstrain.org/flu/seasonal/h3n2/ha/6y>`__
you can see in the header of the page that it's been updated sometime in the
last week or so. Because we update this every week, we have a large archive of
past updates which we can go view. If we want to view the snapshot from mid way
through 2023, we could load `flu/seasonal/h3n2/ha/6y@2023-07-01
<https://dev.nextstrain.org/flu/seasonal/h3n2/ha/6y@2023-07-01>`__, which will
load the latest available snapshot on July 1st, which was a dataset updated on
June 30.

In general, appending a ``@YYYY-MM-DD`` string to a Nextstrain core dataset URL
will load the dataset that was the latest available at that particular date.

.. note::

This functionality is newly introduced in 2024 and is currently only available
for core Nextstrain datasets. There is not yet a way to see a list / visualise
all the available datasets, but this is in the works.


Tanglegrams to compare changes
------------------------------

Using tanglegrams allows us to easily view two different versions of the same
dataset side-by-side. Using the above examples we can view the latest dataset
against the one from the middle of 2023 via the URL
`flu/seasonal/h3n2/ha/6y:flu/seasonal/h3n2/ha/6y@2023-07-01
<https://dev.nextstrain.org/flu/seasonal/h3n2/ha/6y:flu/seasonal/h3n2/ha/6y@2023-07-01>`__. Here's a screenshot of this taken in early January 2024, allowing us to see the expansion of clade
2a.3a.1 over the past 6 months:

.. image:: ../images/tanglegram-h3n2.png
:alt: Tanglegram of flu/seasonal/h3n2/ha/6y:flu/seasonal/h3n2/ha/6y@2023-07-01

Over time, the data shown by this URL link will start to change as we update the dataset, but by versioning both datasets we can preserve this particular view into the data:
`flu/seasonal/h3n2/ha/6y@2024-01-03:flu/seasonal/h3n2/ha/6y@2023-07-01
<https://dev.nextstrain.org/flu/seasonal/h3n2/ha/6y@2024-01-03:flu/seasonal/h3n2/ha/6y@2023-07-01>`__.



Details for dataset maintainers
===============================

This section is more technical and aimed primarily at those managing datasets.


S3 Delete Markers
-----------------
Our core datasets are all stored in a versioned S3 bucket, which is how we are
able to provide this functionality. When files are "deleted" from a versioned
bucket, the normal behaviour is to preserve the file but add a `delete marker
<https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html>`__.
When looking back at versions over time, we interpret the intended behaviour of
a delete marker as removing the then-latest file from history, so it wont be
available via any ``@YYYY-MM-DD`` value.

.. image:: ../images/delete-markers.png


How far back does this go?
--------------------------
Around August 2018. Dataset dependent. TKTK.
https://dev.nextstrain.org/flu/seasonal/h3n2/ha/3y@2018-08-01 is the earliest I could find.

What about if the URL changed over time?
----------------------------------------

We don't currently track this, but this is possible to implement when/if we want to do so. TKTK

SARS-CoV-2 datestamped datasets
-------------------------------

TKTK

Multiple datasets uploaded on the same day
------------------------------------------

A day is UTC. Earliest are ignored. TKTK


Sidecar files
-------------
Must be uploaded the same day. TKTK


Binary file added src/images/delete-markers.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/images/tanglegram-h3n2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ team and other Nextstrain users provide assistance. For private inquiries,
Communicating scientific insights <guides/communicate/index>
Managing an installation <guides/manage-installation>
Contributing <guides/contribute/index>
Viewing previous versions <guides/versions>

.. toctree::
:maxdepth: 1
Expand Down
4 changes: 3 additions & 1 deletion src/learn/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ snapshots of evolving pathogen populations such as `SARS-CoV-2
We use interactive visualizations to enable exploration of curated datasets and
analyses which are continually updated when new genomes are available. This
offers a powerful pathogen surveillance tool to virologists, epidemiologists,
public health officials, and community scientists.
public health officials, and community scientists. In many cases old versions of
these analyses are able to be easily accessed, see :doc:`viewing previous versions
</guides/versions>` for more.

.. rubric:: Open-source software

Expand Down

0 comments on commit 5bf6e4b

Please sign in to comment.