Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Product Refactoring #113

Open
timtrice opened this issue Mar 24, 2018 · 3 comments
Open

Text Product Refactoring #113

timtrice opened this issue Mar 24, 2018 · 3 comments
Labels
High Priority NOW NOW NOW Technical Debt Yea, this isn't going to be pretty but needs to be done
Milestone

Comments

@timtrice
Copy link
Contributor

timtrice commented Mar 24, 2018

Currently, rrricanes scrapes the National Hurricane Center's front-end website for tropical cyclone advisory data. Because of this setup, users are not able to download a specific advisory or a set of advisories within a given time period, among other limitations.

For example, if I wanted to download only the advisories for Hurricane Harvey in a given 72-hour period, I would not be able to. I would need to access a list of all tropical cyclones for that period, pass the storm's name to another function that would scrape that storm's archive page for the product, and then wait for all text products to be pulled, parsed, and reformatted into a tidy format.

This can be a time-consuming task. It is particularly noticeable when building the monthly releases for rrricanesdata.

The individual text files do exists on the NHC's FTP server. It is assumed these are issued in real-time but cannot be guaranteed (modified dates appear to match the issue date, but the times are the same for all products at 1900 UTC).

There are two locations for these text products, depending on the storm being accessed. As of this writing (2018-03-24), all storms in 2016 and prior are in the archives (ftp://ftp.nhc.noaa.gov/atcf/archive/) directory (see subdirectory MESSAGES) . This directory does not contain storms for the 2017 season. Those are located:

  • Public Advisory (ftp://ftp.nhc.noaa.gov/atcf/pub/)

  • Forecast Advisory (ftp://ftp.nhc.noaa.gov/atcf/mar/)

  • Storm Discussion (ftp://ftp.nhc.noaa.gov/atcf/dis/)

  • Wind Speed Probabilities (ftp://ftp.nhc.noaa.gov/atcf/wndprb/)

  • Tropical Cyclone Update (can't find)

  • GIS (ftp://ftp.nhc.noaa.gov/atcf/gis/)

A list of the "current year's" storms can also be found in the index subdirectory (ftp://ftp.nhc.noaa.gov/atcf/index/).

The most recent position of each storm can be found in the adv subdirectory (ftp://ftp.nhc.noaa.gov/atcf/adv/)

I want to make accessing the FTP server the default with the fallback to the NHC's front-end website. I do not want to create new functions to handle this. So, perhaps add a parameter for users to pass if the explicitly want the front-end. Or, hit the FTP site and then, if the product does not exist, revert to the HTML website.

Note: FTP links apparently do not work on GitHub under standard markdown, nor anchor elements.

@timtrice timtrice added High Priority NOW NOW NOW Technical Debt Yea, this isn't going to be pretty but needs to be done labels Mar 24, 2018
@timtrice timtrice added this to the 0.2.1 milestone Mar 24, 2018
@timtrice
Copy link
Contributor Author

FTP ATCF: ftp://ftp.nhc.noaa.gov/atcf/

ATCF Notice: ftp://ftp.nhc.noaa.gov/atcf/NOTICE

ATCF README: ftp://ftp.nhc.noaa.gov/atcf/README

ATCF TROPICAL CYCLONE DATABASE Manual: https://www.nrlmry.navy.mil/atcf_web/docs/database/new/database.html

@timtrice
Copy link
Contributor Author

The FTP server seems very disorganized and there is potential the structure may change which would break any functionality dependent upon it.

At this time I'm going to leave the current handling of text products as is (will still clean up code, add comments, etc.).

I will add FTP handling as a new set of functions, perhaps adding "ftp" into function names. For example, "get_fstadv" would have a FTP counterpart "get_ftp_fstadv".

This seems to be the best method for now to ensure previous code works as expected and also give a second option to obtain even more data.

timtrice added a commit that referenced this issue Dec 24, 2018
Per issues #113, #114 and #115, added some alternate handling of
obtaining storm data.

`get_storm_list` - Retrieves a listing of all cyclones in a master
"database" on the NHC's FTP server. This master database lists all
known storms and includes some INVEST and GENESIS systems though no
advisories are issued for these.

This function should help users quickly find a storm by year, name,
strength and such. It is much faster than the current usage of
`get_storms` and should be the preferable method.

NOTE this data is incomplete; some variables (such as ending datetime)
are NA and other variables are just the status of the system at the
end of it's lifespan, not the maximum status achieved. It should only
be used to list known cyclones.

`get_ftp_storm_data` is comparable to `get_storm_data` with the
exception that it does not take a vector of links but, rather, a key
(`stormid`). These are the unique identifiers for every tropical
cyclone.

The function will take the `stormid` and `products`, access the FTP
server and scrape the requested data. It then returns a dataframe.

NOTE one product request should be passed at a time. And, it is
encouraged that one key be passed at a time. Currently, there are no
time restrictions (as exist with `get_storm_data`). This is because
most all cyclones will not have more than 80 text statements per
product (the requested limit per the NHC is 80 requests per 10
seconds). This should become a TODO but I'm not sure yet how I want
to handle this.

Function `get_ftp_dirs` is a helper function that will retrieve a
list of contents from a FTP directory.

Documentation has also be added.

Adding a vignette should be another TODO but I will wait until I have
more time to test the functionality and timing aspects.
@timtrice
Copy link
Contributor Author

Added some alternate handling of storm data.

get_storm_list - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these.

This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of get_storms and should be the preferable method.

NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones.

get_ftp_storm_data is comparable to get_storm_data with the exception that it does not take a vector of links but, rather, a key (stormid). These are the unique identifiers for every tropical cyclone.

The function will take the stormid and products, access the FTP server and scrape the requested data. It then returns a dataframe.

NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with get_storm_data). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this.

Function get_ftp_dirs is a helper function that will retrieve a list of contents from a FTP directory.

Documentation has also be added.

Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.

Examples

#' Load a list of all storms in the ftp's `storm_list` page
storm_list <- get_storm_list()
Observations: 2,578
Variables: 21
$ STORM_NAME  <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNN...
$ RE          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "...
$ X           <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L"...
$ R2          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R3          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R4          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R5          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ CY          <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6...
$ YYYY        <int> 1851, 1851, 1851, 1851, 1851, 1851, 1852, 1852, 1852, 1852, 1852, 1853, 1853, 1853, 1853, 1853, 1853, 1...
$ TY          <chr> "HU", "HU", "TS", "HU", "TS", "TS", "HU", "HU", "HU", "HU", "HU", "TS", "TS", "HU", "HU", "TS", "HU", "...
$ I           <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"...
$ YYY1MMDDHH  <dttm> 1851-06-25 00:00:00, 1851-07-05 12:00:00, 1851-07-10 12:00:00, 1851-08-16 00:00:00, 1851-09-13 00:00:0...
$ YYY2MMDDHH  <dttm> 1851-06-28 00:00:00, 1851-07-05 12:00:00, 1851-07-10 12:00:00, 1851-08-27 18:00:00, 1851-09-16 18:00:0...
$ SIZE        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ GENESIS_NUM <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PAR1        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PAR2        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PRIORITY    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ STORM_STATE <chr> "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARC...
$ WT_NUMBER   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ STORMID     <chr> "AL011851", "AL021851", "AL031851", "AL041851", "AL051851", "AL061851", "AL011852", "AL021852", "AL0318...
# Return a dataframe of all fstadv products issued for the respective cyclones
AL092017 <- get_ftp_storm_data("AL092017", products = "fstadv")
AL142018 <- get_ftp_storm_data("AL142018", products = "fstadv")

timtrice added a commit that referenced this issue Dec 24, 2018
Per issues #113, #114 and #115, added some alternate handling of
obtaining storm data.

`get_storm_list` - Retrieves a listing of all cyclones in a master
"database" on the NHC's FTP server. This master database lists all
known storms and includes some INVEST and GENESIS systems though no
advisories are issued for these.

This function should help users quickly find a storm by year, name,
strength and such. It is much faster than the current usage of
`get_storms` and should be the preferable method.

NOTE this data is incomplete; some variables (such as ending datetime)
are NA and other variables are just the status of the system at the
end of it's lifespan, not the maximum status achieved. It should only
be used to list known cyclones.

`get_ftp_storm_data` is comparable to `get_storm_data` with the
exception that it does not take a vector of links but, rather, a key
(`stormid`). These are the unique identifiers for every tropical
cyclone.

The function will take the `stormid` and `products`, access the FTP
server and scrape the requested data. It then returns a dataframe.

NOTE one product request should be passed at a time. And, it is
encouraged that one key be passed at a time. Currently, there are no
time restrictions (as exist with `get_storm_data`). This is because
most all cyclones will not have more than 80 text statements per
product (the requested limit per the NHC is 80 requests per 10
seconds). This should become a TODO but I'm not sure yet how I want
to handle this.

Function `get_ftp_dirs` is a helper function that will retrieve a
list of contents from a FTP directory.

Documentation has also be added.

Adding a vignette should be another TODO but I will wait until I have
more time to test the functionality and timing aspects.
timtrice added a commit that referenced this issue Dec 25, 2018
Expanded `get_ftp_storm_data` to read msg.zip files contained in
archives earlier than 1998. This opens up many years prior to 1998.
However, the format of the text products, particularly
forecast/advisory and probabilities has changed > 1997 so regex must
be modified to accomodate older storms.
timtrice added a commit that referenced this issue Jan 4, 2019
- `filter_products` maps through `links`, matches a pattern and
  returns a list. This is unnecessary. Modified the patterns where
	needed to a character vector of length 1 and use `grep` within each
	filter function to return matches.
timtrice added a commit that referenced this issue Jan 4, 2019
- Core functions to obtain data have been modified to vectorize data
  rather than loop through individually as was done previously. This
  should enhance download speeds significantly.

- Moved all dplyr::progres_estimated calls to `get_url_contents`.
  Here, if the group of links is larger than 80 they are split into
  groups and cycled through with the use of the `download_text`
  function. A 10-second delay is applied, if necssary and a progress
  bar is shown, if the option is enabled.

  Otherwise, all links are downloaded ASAP.

- Implemented quality checks on download status within `download_text`.
  If the status returned from a URL is not succesful, a warning is
  displayed with the link of error. The bad results are removed from
  the vector and processing continues.

With these changes, the product functions (`fstadv`, `public`, etc.)
must be rewritten, I believe (though have not tested). They were
originally written to handle one character vector at a time (each
individual text product). Now they are set up to obtain a long vector
of all text products requested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority NOW NOW NOW Technical Debt Yea, this isn't going to be pretty but needs to be done
Projects
None yet
Development

No branches or pull requests

1 participant