Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

31 Adding metadata retrieval to get_record() #37

Conversation

Turtle6665
Copy link
Contributor

@Turtle6665 Turtle6665 commented Feb 9, 2024

Fixes #31.

List of changes

  • All the metadata can be saved by get_records().
  • A get_metadata argument has been added to get_records(). Possible values are id (only include insp_order), and all (include all available metadata). Default is id.
  • The rename_by_id() has been modified to accept field IDs not present in inspection_fields.

A possible improvement I currently see

  • Currently, when using get_record(get_metadata="all"), it's difficult to tell apart metadata and data columns. Saving metadata in a totally different way could be the way to go, but I don't know how it could be clear and efficient at the same time.

Turtle6665 and others added 3 commits February 9, 2024 17:48
Linked to inbo#31:

A small tweek to get `object_id`, `inspection_id` and `insp_order` when using `get_records()`. They are saved alongside data in the output.

The `rename_by_id()` has been slightly modified to accept id's not present in inspection_fields.
Linked to inbo#31

All the metadata can be saved by `get_records()`.
A get_metadata flag has been added to `get_records()`.
@PietrH PietrH added the enhancement New feature or request label Feb 12, 2024
@PietrH PietrH added this to the v0.1 milestone Feb 12, 2024
PietrH
PietrH previously requested changes Feb 12, 2024
Copy link
Member

@PietrH PietrH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a go at this, some notes:

  • get_metadata is a verb, I don't like using verbs for function arguments but like to reserve them for functions. In this case, I think it's justified, but maybe we should go for: include_metadata or show_metadata instead?

  • I know I'm not the best example for this package, but we ought to cover changes with new tests. Let's include some in this PR so we know we aren't breaking anything. I'll help out. It's a bit silly for me to mention this since I haven't done it myself here, but I think this is where we should draw the line. Especially if you we are going to slow down development soon. No worries if this is completely new to you, either just experiment and we'll go trough it together, or ask me for some examples. https://testthat.r-lib.org/

Can you add tests for:

get_record():

  • The new argument you added
  • The new assertation you added

rename_by_id():

  • new functionality

If you have multiple outcomes, test both, so one for argument TRUE one for argument FALSE. If you expect an error, it's alright to cover the error only, but it's best practise to match it as exactly as possible (and not just any error).

I'll add some more tests, either in this PR or in a separate one. Depending on how quick you are/your preference!

  • Before submitting for review, please run R CMD CHECK locally (check under the build menu in RStudio or via devtools). The CI will catch it, but it's neater if your PR doesn't immediately fail. I've generated the documentation and now it's passing.

  • There are some stylistic things, are you using a linter? Consider lintr! Have a look at style.tidyverse.org, no biggie. Usually in review I'll just fix style issues for the contributor, but in this case I'll leave the choice up to you. Let me know if you'd like me to have a go or would like to have a look at it yourself.

Well done! Really appreciate it. 👍

@PietrH
Copy link
Member

PietrH commented Feb 12, 2024

* Currently, when using `get_record(get_metadata=TRUE)`, it's difficult to tell apart metadata and data columns. Saving metadata in a totally different way could be the way to go, but I don't know how it could be clear and efficient at the same time.

I tend to agree. We have to keep the user of the package in mind. In this case, you'd be the user so with the optional flag, I think we'd be fine leaving it in the data. However, a solution I see is to return an object that has the metadata as a child or attribute. rgbif works this way for some function returns. If we go this route, I'd love to use this as an excuse to give S7 a go: https://rconsortium.github.io/S7/articles/S7.html

Also, it's a better design pattern to use an enum even if there only two choices: https://design.tidyverse.org/boolean-strategies.html, are you open to giving this pattern a try?

@PietrH
Copy link
Member

PietrH commented Feb 12, 2024

Just realised why we don't have any tests, because of the risk of exposing API keys. The solution to this is vcr as is used in rgbif.

https://books.ropensci.org/http-testing/ for more information.

With this in mind it might be better to just keep all the testing in a separate PR that we can work together on but I'll be responsible for.

@Turtle6665
Copy link
Contributor Author

get_metadata is a verb, I don't like using verbs for function arguments but like to reserve them for functions. In this case, I think it's justified, but maybe we should go for: include_metadata or show_metadata instead?

I will then go for include_metadata, to me, it's more aligned with what this argument does (it includes the metadata in the returned data). Except if we use enum (see below).

  • Before submitting for review, please run R CMD CHECK locally (check under the build menu in RStudio or via devtools). The CI will catch it, but it's neater if your PR doesn't immediately fail. I've generated the documentation and now it's passing.
  • There are some stylistic things, are you using a linter? Consider lintr! Have a look at style.tidyverse.org, no biggie. Usually in review I'll just fix style issues for the contributor, but in this case I'll leave the choice up to you. Let me know if you'd like me to have a go or would like to have a look at it yourself.

I had never heard of these, but they are definitely worth the use. Thanks for the tips ! I will use lintr from now on and for this PR.

Also, it's a better design pattern to use an enum even if there only two choices: https://design.tidyverse.org/boolean-strategies.html, are you open to giving this pattern a try?

Yes, I can try it, it should not take too long. The enum to replace the get_metadata could be metadata = c("discard","include"). You may have more favorable options ?

@PietrH
Copy link
Member

PietrH commented Feb 12, 2024

What do you think about none, all, and id ? What should be the default? None? Id? I believe including an id by default is a great idea

@Turtle6665
Copy link
Contributor Author

@PietrH, this is indeed a good idea. For the included id, which of the following id would you consider the best choice to return alongside the data:

  • only getting the id metadata (which is in the form of "inspection_id_insp_order"). This would probably be less confusing for users as when using the get_metedata=all, the id columns whould be the same.
  • the three columns object_id, inspection_id and insp_order. This would not break if, in the future, observations would be saved in multiple objects.
  • only the insp_order as it's what the GUI does when exporting from IAsset as xlsx (the column is renamed as ID) and the one on the GUI map. This is therefore more consistent with the current GUI behavior.

@PietrH
Copy link
Member

PietrH commented Feb 14, 2024

@Turtle6665 I'm having some trouble visualizing the first option. Could you give a (mock) example? Just a few rows of fake data is fine!

@Turtle6665
Copy link
Contributor Author

@PietrH, This would be the first option:
image
The second:
image
and the third:
image
and the third bis (to be more consistent with the GUI export but strange considering the id columns would be different from the one when using all):
image

@PietrH
Copy link
Member

PietrH commented Feb 14, 2024

My vote goes for the third option

inbo#31
- Adding  `id`, `none` and `all` to `get_metadata` argument of `get_records()`.
- Used lintr to spot and change small code formating issues.
Remove warning "missing '.' global variable" by using an anonymous function.

inbo#31
inbo#31
change based on lintr comments.
@Turtle6665
Copy link
Contributor Author

@PietrH, The get_metadata is now an enum with the three options.
By testing a bit, I found out that the due to this line, the number of observations differ in the case with get_metadata="none" (currently 10463) and the other two (currently 10488). As some observations with different insp_orders have the same data (they are duplicates in IAsset), filtering out (with dplyr::distinct()) only based on the data and not with the metadata means we remove more observations than with the other two options.

Removing this row is not the way to go as some observations are in still in double (same data and same metadata).
All those insp_order have at least one double entry in the returndata : "59923" "59942" "59638" "59695" "59752" "59771" "59790" "59809" "59828" "59847" "59866" "59885" "59904"

@PietrH
Copy link
Member

PietrH commented Feb 15, 2024

How about only removing the columns after you've done the distinct?

Are you sure this is an issue? You are more suitable to answer this than me because I don't actually use the iAsset data. But to me it doesn't seem important to get the same number of rows with all metadata options. If you are interested in finding duplicates, you'd want to set metadata appropriately wouldn't you?

On second thought, why are we even offering metadata none, why would you want to get the data without the id? You could still get rid of it yourself if you wanted to anyway.

I propose we get rid of the option none and solve this issue that way.

PS: When you are done working on this, please request my review again and I'll have a more in depth look

@PietrH
Copy link
Member

PietrH commented Feb 23, 2024

I haven't found the time to look at this yet, but it's on my radar.

@SanderDevisscher
Copy link
Collaborator

@PietrH can this PR be approved ?

@PietrH
Copy link
Member

PietrH commented Mar 29, 2024

@SanderDevisscher Can you have a look at the changes starting with 32b7dd0, I trust your judgement. Make sure the CI tests pass.

If nothing else we should setup some unit tests for this package. Without them reviewing is becoming more and more difficult.

Copy link
Collaborator

@SanderDevisscher SanderDevisscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works like I would expect it to

@SanderDevisscher SanderDevisscher dismissed their stale review March 29, 2024 12:28

Requested changes were implemented

@PietrH
Copy link
Member

PietrH commented Mar 29, 2024

Thanks for the review Sander, ok to merge for me. I'll leave the honors to Turtle6665

@SanderDevisscher SanderDevisscher dismissed PietrH’s stale review March 29, 2024 12:29

changes were implemented

@SanderDevisscher SanderDevisscher merged commit decafcc into inbo:main Mar 29, 2024
7 checks passed
@PietrH PietrH removed their request for review March 29, 2024 12:29
@SanderDevisscher
Copy link
Collaborator

@Turtle6665 is no longer active at our team so I did the merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_record() doesn't give an ID for each observation.
3 participants