Text Product Refactoring #113

timtrice · 2018-03-24T13:04:21Z

Currently, rrricanes scrapes the National Hurricane Center's front-end website for tropical cyclone advisory data. Because of this setup, users are not able to download a specific advisory or a set of advisories within a given time period, among other limitations.

For example, if I wanted to download only the advisories for Hurricane Harvey in a given 72-hour period, I would not be able to. I would need to access a list of all tropical cyclones for that period, pass the storm's name to another function that would scrape that storm's archive page for the product, and then wait for all text products to be pulled, parsed, and reformatted into a tidy format.

This can be a time-consuming task. It is particularly noticeable when building the monthly releases for rrricanesdata.

The individual text files do exists on the NHC's FTP server. It is assumed these are issued in real-time but cannot be guaranteed (modified dates appear to match the issue date, but the times are the same for all products at 1900 UTC).

There are two locations for these text products, depending on the storm being accessed. As of this writing (2018-03-24), all storms in 2016 and prior are in the archives (ftp://ftp.nhc.noaa.gov/atcf/archive/) directory (see subdirectory MESSAGES) . This directory does not contain storms for the 2017 season. Those are located:

Public Advisory (ftp://ftp.nhc.noaa.gov/atcf/pub/)
Forecast Advisory (ftp://ftp.nhc.noaa.gov/atcf/mar/)
Storm Discussion (ftp://ftp.nhc.noaa.gov/atcf/dis/)
Wind Speed Probabilities (ftp://ftp.nhc.noaa.gov/atcf/wndprb/)
Tropical Cyclone Update (can't find)
GIS (ftp://ftp.nhc.noaa.gov/atcf/gis/)

A list of the "current year's" storms can also be found in the index subdirectory (ftp://ftp.nhc.noaa.gov/atcf/index/).

The most recent position of each storm can be found in the adv subdirectory (ftp://ftp.nhc.noaa.gov/atcf/adv/)

I want to make accessing the FTP server the default with the fallback to the NHC's front-end website. I do not want to create new functions to handle this. So, perhaps add a parameter for users to pass if the explicitly want the front-end. Or, hit the FTP site and then, if the product does not exist, revert to the HTML website.

Note: FTP links apparently do not work on GitHub under standard markdown, nor anchor elements.

timtrice · 2018-03-26T12:18:12Z

FTP ATCF: ftp://ftp.nhc.noaa.gov/atcf/

ATCF Notice: ftp://ftp.nhc.noaa.gov/atcf/NOTICE

ATCF README: ftp://ftp.nhc.noaa.gov/atcf/README

ATCF TROPICAL CYCLONE DATABASE Manual: https://www.nrlmry.navy.mil/atcf_web/docs/database/new/database.html

timtrice · 2018-03-26T12:34:22Z

The FTP server seems very disorganized and there is potential the structure may change which would break any functionality dependent upon it.

At this time I'm going to leave the current handling of text products as is (will still clean up code, add comments, etc.).

I will add FTP handling as a new set of functions, perhaps adding "ftp" into function names. For example, "get_fstadv" would have a FTP counterpart "get_ftp_fstadv".

This seems to be the best method for now to ensure previous code works as expected and also give a second option to obtain even more data.

Per issues #113, #114 and #115, added some alternate handling of obtaining storm data. `get_storm_list` - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these. This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of `get_storms` and should be the preferable method. NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones. `get_ftp_storm_data` is comparable to `get_storm_data` with the exception that it does not take a vector of links but, rather, a key (`stormid`). These are the unique identifiers for every tropical cyclone. The function will take the `stormid` and `products`, access the FTP server and scrape the requested data. It then returns a dataframe. NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with `get_storm_data`). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this. Function `get_ftp_dirs` is a helper function that will retrieve a list of contents from a FTP directory. Documentation has also be added. Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.

timtrice · 2018-12-24T18:54:47Z

Added some alternate handling of storm data.

get_storm_list - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these.

This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of get_storms and should be the preferable method.

NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones.

get_ftp_storm_data is comparable to get_storm_data with the exception that it does not take a vector of links but, rather, a key (stormid). These are the unique identifiers for every tropical cyclone.

The function will take the stormid and products, access the FTP server and scrape the requested data. It then returns a dataframe.

NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with get_storm_data). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this.

Function get_ftp_dirs is a helper function that will retrieve a list of contents from a FTP directory.

Documentation has also be added.

Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.

Examples

#' Load a list of all storms in the ftp's `storm_list` page
storm_list <- get_storm_list()

Observations: 2,578
Variables: 21
$ STORM_NAME  <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNN...
$ RE          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "...
$ X           <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L"...
$ R2          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R3          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R4          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ R5          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ CY          <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6...
$ YYYY        <int> 1851, 1851, 1851, 1851, 1851, 1851, 1852, 1852, 1852, 1852, 1852, 1853, 1853, 1853, 1853, 1853, 1853, 1...
$ TY          <chr> "HU", "HU", "TS", "HU", "TS", "TS", "HU", "HU", "HU", "HU", "HU", "TS", "TS", "HU", "HU", "TS", "HU", "...
$ I           <chr> "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"...
$ YYY1MMDDHH  <dttm> 1851-06-25 00:00:00, 1851-07-05 12:00:00, 1851-07-10 12:00:00, 1851-08-16 00:00:00, 1851-09-13 00:00:0...
$ YYY2MMDDHH  <dttm> 1851-06-28 00:00:00, 1851-07-05 12:00:00, 1851-07-10 12:00:00, 1851-08-27 18:00:00, 1851-09-16 18:00:0...
$ SIZE        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ GENESIS_NUM <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PAR1        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PAR2        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ PRIORITY    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ STORM_STATE <chr> "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARCHIVE", "ARC...
$ WT_NUMBER   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ STORMID     <chr> "AL011851", "AL021851", "AL031851", "AL041851", "AL051851", "AL061851", "AL011852", "AL021852", "AL0318...

# Return a dataframe of all fstadv products issued for the respective cyclones
AL092017 <- get_ftp_storm_data("AL092017", products = "fstadv")
AL142018 <- get_ftp_storm_data("AL142018", products = "fstadv")

Per issues #113, #114 and #115, added some alternate handling of obtaining storm data. `get_storm_list` - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these. This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of `get_storms` and should be the preferable method. NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones. `get_ftp_storm_data` is comparable to `get_storm_data` with the exception that it does not take a vector of links but, rather, a key (`stormid`). These are the unique identifiers for every tropical cyclone. The function will take the `stormid` and `products`, access the FTP server and scrape the requested data. It then returns a dataframe. NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with `get_storm_data`). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this. Function `get_ftp_dirs` is a helper function that will retrieve a list of contents from a FTP directory. Documentation has also be added. Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.

Expanded `get_ftp_storm_data` to read msg.zip files contained in archives earlier than 1998. This opens up many years prior to 1998. However, the format of the text products, particularly forecast/advisory and probabilities has changed > 1997 so regex must be modified to accomodate older storms.

- `filter_products` maps through `links`, matches a pattern and returns a list. This is unnecessary. Modified the patterns where needed to a character vector of length 1 and use `grep` within each filter function to return matches.

- Core functions to obtain data have been modified to vectorize data rather than loop through individually as was done previously. This should enhance download speeds significantly. - Moved all dplyr::progres_estimated calls to `get_url_contents`. Here, if the group of links is larger than 80 they are split into groups and cycled through with the use of the `download_text` function. A 10-second delay is applied, if necssary and a progress bar is shown, if the option is enabled. Otherwise, all links are downloaded ASAP. - Implemented quality checks on download status within `download_text`. If the status returned from a URL is not succesful, a warning is displayed with the link of error. The bad results are removed from the vector and processing continues. With these changes, the product functions (`fstadv`, `public`, etc.) must be rewritten, I believe (though have not tested). They were originally written to handle one character vector at a time (each individual text product). Now they are set up to obtain a long vector of all text products requested.

timtrice added High Priority NOW NOW NOW Technical Debt Yea, this isn't going to be pretty but needs to be done labels Mar 24, 2018

timtrice added this to the 0.2.1 milestone Mar 24, 2018

timtrice mentioned this issue Mar 25, 2018

Add func get_storm_list #114

Closed

timtrice added a commit that referenced this issue Dec 24, 2018

Add funcs get_storm_list, get_ftp_storm_data (#113)

cdeff7d

timtrice mentioned this issue Dec 24, 2018

Modify retrieval of forecast/advisory products #115

Closed

timtrice mentioned this issue Dec 31, 2018

Refactor pattern matching [placeholder] #120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Product Refactoring #113

Text Product Refactoring #113

timtrice commented Mar 24, 2018 •

edited

Loading

timtrice commented Mar 26, 2018

timtrice commented Mar 26, 2018

timtrice commented Dec 24, 2018

Text Product Refactoring #113

Text Product Refactoring #113

Comments

timtrice commented Mar 24, 2018 • edited Loading

timtrice commented Mar 26, 2018

timtrice commented Mar 26, 2018

timtrice commented Dec 24, 2018

Examples

timtrice commented Mar 24, 2018 •

edited

Loading