Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

records returned with lat lon in varying CRS but CRS or datum/geoid not provided #38

Closed
tphilippi opened this issue Nov 14, 2019 · 3 comments

Comments

@tphilippi
Copy link

Records returned from calls to ridigbio::idig_search_records() can have lat lon components of geopoint in mixed coordinate systems. However, the geodeticDatum variable is not included in fields = "all", so the only indication of an issue is that the flags list includes "geopoint_datum_error".
And the utility function What is needed is the value of the geodeticDatum or epsg or CRS to allow the user to make valid use of the lat lon coordinates.

Simple Example with mix of WGS84/epsg:4326 and NAD27 lat lon:

library(ridigbio)
test <- idig_search_records(rq = list(scientificname = "Rana cascadae",
                            geopoint = list(type="geo_bounding_box", 
                                            top_left=list(lat=40.85288, lon=-121.9347), 
                                            bottom_right=list(lat=-40.14134, lon= -120.8895)
                                     )
                            ),
                    fields = "all")

test[test$uuid %in% c("58999bd0-d35a-4bbf-9695-a51732807867",
                      "74bec6f7-a4d8-4f8a-805a-ed63fa23fd38"),]
# NAD 27
idig_view_records(uuid = "58999bd0-d35a-4bbf-9695-a51732807867")$data
# WGS 84
idig_view_records(uuid = "74bec6f7-a4d8-4f8a-805a-ed63fa23fd38")$data

https://www.idigbio.org/portal/records/58999bd0-d35a-4bbf-9695-a51732807867
https://www.idigbio.org/portal/records/74bec6f7-a4d8-4f8a-805a-ed63fa23fd38

idig_meta_fields() shows that the geodeticDatum in that portal page exists:

    .. ..$ fieldName: chr "data.dwc:geodeticDatum"

but I cannot translate that fieldName to one that works in the fields = parameter for idig_search_records().

Note that this is also an issue for the Darwin Core Archives returned from the portal, where the geodeticDatum field is not returned, and thus the values for idigbio:geoPoint in occurrence.csv cannot be properly interpreted.

So, my ask is:

  1. a note about the lat lon values be added to the man page for idig_search_records()
  2. the geodeticDatum field be included in any download table that includes geoPoint or lat lon.
  3. an example on the man page for either idig_meta_fields() or idig_search_records() on taking a filed name and then using it in the fields parameter of idig_search_records()
  4. geodeticDatum be added to the Darwin Core Archive response from https://www.idigbio.org/portal/search I'm posted a comment on the feedback for that portal, and mentioned this issue in that comment.

Also, see ropensci/spocc#223 for a simple use-case where gbif and iDigBio return different lat lon for the same occurrence.

@roncanepa
Copy link

@tphilippi , Thank you very much for bringing this to our attention and for the excellent examples. You are correct that this is less than ideal and there are a few different things wrapped up in all of this.

Regarding points 1 and 3, I've been wanting to improve the documentation and examples for the ridigbio package/repo and will add your suggestions to the list. I've created a new issue to begin tracking documentation changes and suggestions here: #39 . In related news, some of us at iDigBio are in the process of forming a community working group focused on consuming iDigBio and other specimen-based API data via R, and I'm hoping that improvements to the documentation and additional examples will also result from the work done there.

This issue also has different components as far as adjustments to data that iDigBio does when ingesting for data quality purposes and what we make available when searching and downloading. Regarding the former, I haven't been on the project very long, and so there may have been discussions that I'm not aware of pertaining to adjustments of this type (reprojection, etc) upon ingest, whether it was considered and then shelved, or has been on a TODO list. That will be a much deeper issue. Regarding the latter, this is thankfully something that's much easier for us to deal with.

One thing that can help with API search requests is that you can specify fields within the raw provider data to be returned by including them in the fields list, like so when using ridigbio (where I've shortened the field list for example purposes):

test <- idig_search_records(rq = list(scientificname = "Rana cascadae",
    geopoint = list(type="geo_bounding_box", 
                    top_left=list(lat=40.85288, lon=-121.9347), 
                    bottom_right=list(lat=-40.14134, lon= -120.8895)
    )
),
fields = c("uuid", "data.dwc:geodeticDatum"))

This isn't the greatest from a usability standpoint, in that you'd have to then specify every single field that you're interested in, but it at least makes the data available. fields = "all" in ridigbio isn't very clear about what it's actually doing under the hood and unfortunately doesn't follow the principle of least surprise.

I also note that when doing so, we see in the underlying data that the coordinate systems are listed in both abbreviated and expanded form within provider data, e.g., "NAD27" and "North American Datum 1927", but that's a relatively minor issue.

Regarding availability of this information within the downloads, since iDigBio does not apply adjustments, it does not appear in the occurrence.csv file, but it does appear in the occurrence_raw.csv file. Note that this is true whether the search is done via the web portal (using the "mapping" tab to specify the bounding box portion of your example) or when interacting with the download API directly, as they both use the same underlying code. An example of calling directly to the download API:

https://api.idigbio.org/v2/download/?rq={"scientificname":"Rana cascadae","geopoint": {
"type": "geo_bounding_box",
"top_left": {
"lat": 40.85288,
"lon": -121.9347
},
"bottom_right": {
"lat": -40.14134,
"lon": -120.8895
}
}}

I hope that this helps you get the information you need from your search results?

@tphilippi
Copy link
Author

@roncanepa
Thanks. Your answer gives me what I need for both the live hits to the API via ridigbio and my processing of larger Darwin Core Archive files from the portal.

Your example of exactly how to specify that field name gets me over this hurdle. [Documentation in point 3 above would be great.] I already specify a set of field names because fieldlist = "all" returns only fields with at least 1 non-missing value, so repeated calls with different bounding boxes give dataframes that don't simply rbind.

I agree that the different values for the geodeticDatum field are a minor problem. As a DwC term, the recommended controlled value is an EPSG
https://terms.tdwg.org/wiki/dwc:geodeticDatum
https://dwc.tdwg.org/terms/#geodeticDatum
but in the real world, older records have what they have.

My opinion is that if new iDigBio records use EPSG, iDigBio need not work to clean those old values: gbif does that when it ingests from iDigBio or the same museums. I am completely satisfied by being able to take my returned iDigBio data, then work through translating the geodeticDatum values to full CRS/EPSG myself. Some values will be ambiguous, and I will need to go back to the contributing dataset and look at the full information if I really need that record. iDigBio has data provenance built-in so I can do that. Exposing geodeticDatum allows me to take care of the easy cases in bulk and only track down a small handful of important, ambiguous records.

Your note that occurrence_raw.csv has dwc:geodeticDatum will help me with my DwC-A files from the portal (we have national parks in countries AS, GU, MP, PR, US, VI, and adjacent to CA & MX. At the least, that gives me additoinal info beyond the parsed flag value noting an issue "geopoint_datum_error". Again, I won't be able to resolve all coordinates, but I will be able to be confident in a larger fraction of records.

@roncanepa
Copy link

Glad to hear that this will allow you to proceed. Please let us know if you have any other questions or encounter other problems. I'll also add these documentation improvements to the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants