-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Product Refactoring #113
Comments
FTP ATCF: ftp://ftp.nhc.noaa.gov/atcf/ ATCF Notice: ftp://ftp.nhc.noaa.gov/atcf/NOTICE ATCF README: ftp://ftp.nhc.noaa.gov/atcf/README ATCF TROPICAL CYCLONE DATABASE Manual: https://www.nrlmry.navy.mil/atcf_web/docs/database/new/database.html |
The FTP server seems very disorganized and there is potential the structure may change which would break any functionality dependent upon it. At this time I'm going to leave the current handling of text products as is (will still clean up code, add comments, etc.). I will add FTP handling as a new set of functions, perhaps adding "ftp" into function names. For example, "get_fstadv" would have a FTP counterpart "get_ftp_fstadv". This seems to be the best method for now to ensure previous code works as expected and also give a second option to obtain even more data. |
Per issues #113, #114 and #115, added some alternate handling of obtaining storm data. `get_storm_list` - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these. This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of `get_storms` and should be the preferable method. NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones. `get_ftp_storm_data` is comparable to `get_storm_data` with the exception that it does not take a vector of links but, rather, a key (`stormid`). These are the unique identifiers for every tropical cyclone. The function will take the `stormid` and `products`, access the FTP server and scrape the requested data. It then returns a dataframe. NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with `get_storm_data`). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this. Function `get_ftp_dirs` is a helper function that will retrieve a list of contents from a FTP directory. Documentation has also be added. Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.
Added some alternate handling of storm data.
This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones.
The function will take the NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with Function Documentation has also be added. Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects. Examples#' Load a list of all storms in the ftp's `storm_list` page
storm_list <- get_storm_list()
# Return a dataframe of all fstadv products issued for the respective cyclones
AL092017 <- get_ftp_storm_data("AL092017", products = "fstadv")
AL142018 <- get_ftp_storm_data("AL142018", products = "fstadv") |
Per issues #113, #114 and #115, added some alternate handling of obtaining storm data. `get_storm_list` - Retrieves a listing of all cyclones in a master "database" on the NHC's FTP server. This master database lists all known storms and includes some INVEST and GENESIS systems though no advisories are issued for these. This function should help users quickly find a storm by year, name, strength and such. It is much faster than the current usage of `get_storms` and should be the preferable method. NOTE this data is incomplete; some variables (such as ending datetime) are NA and other variables are just the status of the system at the end of it's lifespan, not the maximum status achieved. It should only be used to list known cyclones. `get_ftp_storm_data` is comparable to `get_storm_data` with the exception that it does not take a vector of links but, rather, a key (`stormid`). These are the unique identifiers for every tropical cyclone. The function will take the `stormid` and `products`, access the FTP server and scrape the requested data. It then returns a dataframe. NOTE one product request should be passed at a time. And, it is encouraged that one key be passed at a time. Currently, there are no time restrictions (as exist with `get_storm_data`). This is because most all cyclones will not have more than 80 text statements per product (the requested limit per the NHC is 80 requests per 10 seconds). This should become a TODO but I'm not sure yet how I want to handle this. Function `get_ftp_dirs` is a helper function that will retrieve a list of contents from a FTP directory. Documentation has also be added. Adding a vignette should be another TODO but I will wait until I have more time to test the functionality and timing aspects.
Expanded `get_ftp_storm_data` to read msg.zip files contained in archives earlier than 1998. This opens up many years prior to 1998. However, the format of the text products, particularly forecast/advisory and probabilities has changed > 1997 so regex must be modified to accomodate older storms.
- `filter_products` maps through `links`, matches a pattern and returns a list. This is unnecessary. Modified the patterns where needed to a character vector of length 1 and use `grep` within each filter function to return matches.
- Core functions to obtain data have been modified to vectorize data rather than loop through individually as was done previously. This should enhance download speeds significantly. - Moved all dplyr::progres_estimated calls to `get_url_contents`. Here, if the group of links is larger than 80 they are split into groups and cycled through with the use of the `download_text` function. A 10-second delay is applied, if necssary and a progress bar is shown, if the option is enabled. Otherwise, all links are downloaded ASAP. - Implemented quality checks on download status within `download_text`. If the status returned from a URL is not succesful, a warning is displayed with the link of error. The bad results are removed from the vector and processing continues. With these changes, the product functions (`fstadv`, `public`, etc.) must be rewritten, I believe (though have not tested). They were originally written to handle one character vector at a time (each individual text product). Now they are set up to obtain a long vector of all text products requested.
Currently, rrricanes scrapes the National Hurricane Center's front-end website for tropical cyclone advisory data. Because of this setup, users are not able to download a specific advisory or a set of advisories within a given time period, among other limitations.
For example, if I wanted to download only the advisories for Hurricane Harvey in a given 72-hour period, I would not be able to. I would need to access a list of all tropical cyclones for that period, pass the storm's name to another function that would scrape that storm's archive page for the product, and then wait for all text products to be pulled, parsed, and reformatted into a tidy format.
This can be a time-consuming task. It is particularly noticeable when building the monthly releases for rrricanesdata.
The individual text files do exists on the NHC's FTP server. It is assumed these are issued in real-time but cannot be guaranteed (modified dates appear to match the issue date, but the times are the same for all products at 1900 UTC).
There are two locations for these text products, depending on the storm being accessed. As of this writing (2018-03-24), all storms in 2016 and prior are in the archives (ftp://ftp.nhc.noaa.gov/atcf/archive/) directory (see subdirectory MESSAGES) . This directory does not contain storms for the 2017 season. Those are located:
Public Advisory (ftp://ftp.nhc.noaa.gov/atcf/pub/)
Forecast Advisory (ftp://ftp.nhc.noaa.gov/atcf/mar/)
Storm Discussion (ftp://ftp.nhc.noaa.gov/atcf/dis/)
Wind Speed Probabilities (ftp://ftp.nhc.noaa.gov/atcf/wndprb/)
Tropical Cyclone Update (can't find)
GIS (ftp://ftp.nhc.noaa.gov/atcf/gis/)
A list of the "current year's" storms can also be found in the index subdirectory (ftp://ftp.nhc.noaa.gov/atcf/index/).
The most recent position of each storm can be found in the adv subdirectory (ftp://ftp.nhc.noaa.gov/atcf/adv/)
I want to make accessing the FTP server the default with the fallback to the NHC's front-end website. I do not want to create new functions to handle this. So, perhaps add a parameter for users to pass if the explicitly want the front-end. Or, hit the FTP site and then, if the product does not exist, revert to the HTML website.
Note: FTP links apparently do not work on GitHub under standard markdown, nor anchor elements.
The text was updated successfully, but these errors were encountered: