Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit testing, renaming functions, adding docs #124

Merged
merged 99 commits into from
Oct 17, 2024
Merged

Unit testing, renaming functions, adding docs #124

merged 99 commits into from
Oct 17, 2024

Conversation

RayStick
Copy link
Member

@RayStick RayStick commented Sep 13, 2024

Closes #38

Overall description

This PR adds unit testing for this R package.

In order to do this, the main package function(s) needed to be split up into smaller functions, in order to test. When writing tests, it was sometimes obvious that functions should be improved in various ways, and a few sub-optimal coding things were found and corrected.

It also became obvious that there were 2 parts to this package as a whole - the browsing and the mapping. This has been made clearer now, by splitting up two functions, and adding some nice plots.

Changes implemented (by directory)

Parent directory:

  • Updated the .Rbuildignore and .gitignore files to better reflect the package structure
    • there were 3 .gitignore files and I have now combined into 1
  • DESCRIPTION and NAMESPACE reflect changes to the package, specifically to do with how imports and exports are handled
  • README.md reflects the new code changes

R directory:

There are now 4 main functions that a user can interact with:

  1. browseMetadata.R (new!)
  2. mapMetadata.R(previously domain_mapping.R)
  3. mapMetadata_compare_outputs.R (previously compare_sessions.R)
  4. mapMetadata_convert_outputs.R(previously convert_outputs.R)

All of the other files in the R directory are either:

  1. Sub-functions that are called within one of the 4 functions above
  2. Package data

data directory:

The same 2 dataframes were made multiple times across the functions. Therefore, to reduce lines of code, they have now been included in the package data:

  • data/log_Output.rda
  • data/Output.rda

These can now be read into the functions with one line e.g. Output <- get("Output")

inst directory:

Some example inputs and outputs have been moved or added, so they can be referenced in the README, or other documentation throughout the package.

man directory:

All new functions require documentation via .Rd files

tests directory:

All functions require unit tests, written with the testthat package.

Checklist to make it ready for review (for @RayStick):

  • The title of this PR is clear and self-explantory.
  • I added any appropriate labels to this PR.
  • Appropriately handle imports across /R and /test directory
  • Split browseMetadata.R into smaller functions?
  • Ensure all unit tests pass
  • Do user testing to ensure all functions that user will interact with still work as intended - write notes for reviewer as I go
  • Ensure README reflect the new code changes
  • Ensure devtools::check() is still happy

Checklist for reviewers (@Rainiefantasy )

Please feel free comment on my PR while it's a draft and give me feedback on the development!

  • Accept Rachael's apology for how long this PR is - we can have a 1:1 to bring context before you review
  • First read the new README file for an overview of what has changed from user perspective
  • Look at all the changed files (high-level) using the ' Changes implemented (by directory)' section above as a guide
  • Complete user testing of all 4 functions to ensure all functions work as intended and match the README guide (see below)
  • Rachael do final checks of README - after code changes have happened, and check devtools::check() is still happy

Tips for user testing (@Rainiefantasy)

  1. Open up R Studio with nothing in your env etc.
  2. setwd('your-path/test_dir')
  3. remove.packages("browseMetadata") - you may need to specify a path
  4. devtools::install_github("aim-rsf/browseMetadata", ref = 'big-refactor')
  5. library(browseMetadata)

Testing browseMetadata.R

  • First run as the README suggests (https://github.com/aim-rsf/browseMetadata/blob/big-refactor/README.md#browsemetadatar) using package files and no outputdir
  • Continue to use package files, but change the output_dir
  • Then use some different inputs files e.g. ADBE and EDUW datasets
  • For each run above, check there are (1) 3 file outputs (2) they are in the output dir you expect and (3) the html outputs can be opened in the browser and they look sensible

Testing mapMetadata.R

  • First run in demo mode
    • Process 1 table in a run, delete outputs
    • Process 1 table in a run, keep outputs
    • Process 2 tables in a run, check that the COPY function kicks in
  • Then run outside of demo mode (changing the input arguments in various ways to ensure function still works)
  • This is the main function to test (the other 3 are much simpler) so please test it with many variations :)

Testing mapMetadata_convert_outputs.R

  • This function hasn't changed so only a quick test should do the trick:

mapMetadata_convert_outputs(output_csv = 'OUTPUT_xxx.csv', output_dir = /path/test_dir/')

Check that the above call outputs L-OUTPUT_xxx.csv and that any rows that had multiple categorizations have now been split onto their own rows.

Testing mapMetadata_compare_outputs.R

  • This function has only changed a little but a quick test should do the trick
  • Remember you have the files in the inst/inputs folder to point towards as quick inputs for some of the arguments

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 13, 2024
@RayStick RayStick changed the title big refactor - renaming and adding tests Unit testing, renaming functions, adding docs Sep 13, 2024
@RayStick RayStick added the enhancement Feature improvement or addition label Sep 13, 2024
@RayStick RayStick added this to the before rOpenSci milestone Sep 13, 2024
@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 14, 2024

Just a small query:
When I run: browseMetadata(json_file = demo_json_file)
I get:
ℹ Three outputs have been saved to your output directory.
ℹ Open the two html files in your browser for full screen viewing.
$table_fig
$empty_fig
returned. What is the $empty_fig returned for?

@RayStick
Copy link
Member Author

RayStick commented Oct 14, 2024

Just a small query:
When I run: browseMetadata(json_file = demo_json_file)
I get:
ℹ Three outputs have been saved to your output directory.
ℹ Open the two html files in your browser for full screen viewing.
$table_fig
$empty_fig
returned. What is the $empty_fig returned for?

Good question. It is because these are the names of the two plots 'table_fig' and 'empty_fig' in the code. 'table_fig' is the html output that returns the table, and 'empty_fig' is the bar chart that counts how many variables have empty descriptions. I am not sure how to suppress this console output whilst also allowing the figures to be returned in the 'Viewer' tab but there is likely a way, if we wanted to do that. Also, if 'empty_fig' is a confusing name we could call it 'barplot_fig' instead. Perhaps this is even clearer:
$table_html
$barplot_html

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 14, 2024

browseMetadata()

  1. tested with different directories, valid & invalid and both work as expected
  2. tested with invalid file, threw error, as expected:
    Error in readLines(file, warn = FALSE) : 'con' is not a connection
    mapMetadata()
  3. input wrong format, throws error as expected:
    ✖ Your input is in the wrong format. Reference the allowable list of integers and try again.
  4. input out of range, throws error as expected:
    ✖ One of your inputs is out of range! Reference the allowable list of integers and try again.
  5. Ignore below - solved by doing the following:
1. Open up R Studio with nothing in your env etc.
2. setwd('your-path/test_dir')
3. remove.packages("browseMetadata") - you may need to specify a path
4. devtools::install_github("aim-rsf/browseMetadata", ref = 'big-refactor')
5. library(browseMetadata)

as per your suggestion, and it works now!
Did I miss something, I hope I didn't do something wrong!
When running mapMetadata() I'm getting an error thrown when I process a table (table 4, here):
Optional free text note about this table (or press enter to continue): n ℹ There are 11 data elements (variables) in this table. ℹ 11 left to process ℹ Data element 1 of 11 Error in add_row(): ! New rows can't add columns. ✖ Can't find columns DataElement, DataElement_N, Domain_code, and Notein.data. Run rlang::last_trace() to see where the error occurred.

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 14, 2024

Continuing testing mapMetadata

  1. out of range values for domain mapping, error as expected:
    ! Formatting is invalid or integer out of range. Provide one integer or a comma seperated list of integers.
  2. Out of range re-categorisation, error as expected:
    ✖ One of your inputs is out of range! Reference the allowable list of integers and try again.
  3. Input in wrong format, error as expected:
    <simpleError in scan(file = "", what = 0): scan() expected 'a real', got 'mapMetadata()'> ✖ Your input is in the wrong format. Reference the allowable list of integers and try again.
  4. Recategorised BIRTH_ORDER variable domain from 7 -> 1 to check if it shows up with new domain code, and it does. :)
image
  1. If processing same table more than once in the same session, it skips duplicates, which is good!
  2. If processing same table in a new session, it recognises the history of the previous one and auto-categorises as expected, which is good :)!

Copying from previous session(s): [1] "OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-14-14-46-03.csv"
And in output, notes column:
COPIED FROM: CHILD

@Rainiefantasy
Copy link
Contributor

Question - running:
demo_json_file <- system.file("inputs/national_community_child_health_database_(ncchd)_20240405T130125.json", package = "browseMetadata")
followed by
browseMetadata(json_file = demo_json_file, output_dir = NULL)
works fine,

When changing demo file:
demo_json_file <- system.file("inputs/**annual_district_birth_extract_(adbe)_20230908T111217.json", package = "browseMetadata**")
followed by this:
browseMetadata(json_file = demo_json_file, output_dir = NULL)
I get this error:
Error in fromJSON(file = json_file) : attempt to set index 1/1 in SET_STRING_ELT In addition: Warning message: In file(con, "r") : file("") only supports open = "w+" and open = "w+b": using the former
Doesn't work? Not sure if I did something wrong 😸

@Rainiefantasy
Copy link
Contributor

The arguments in the documentation is also a bit unclear - i.e. under Arguments for browseMetadata:
The metadata file. This should be a json download from the metadata catalogue. By default, 'data/json_metadata.rda' is used - run '?json_metadata' to see how it was created.
It's unclear if I need to rerun the steps specified in ?json_metadata, as it says by default the rda file is used, but pretty sure it's the json right?
Let me know if that doesn't make sense and I can explain more!

@RayStick
Copy link
Member Author

RayStick commented Oct 15, 2024

Question - running:

Ah, I see the confusion. It should be:
non_demo_json_file <- "inputs/**annual_district_birth_extract_(adbe)_20230908T111217.json".
browseMetadata(json_file = non_demo_json_file, output_dir = NULL)

The 'system.file' syntax is only used for package data. Let me see if it would be simple to change the code so that you don't have to give any inputs when running browseMetadata in demo mode (as like mapMetadata).

The arguments in the documentation is also a bit unclear - i.e. under Arguments for browseMetadata:

Nice catch! This is an error. I copied this over from mapMetadata

Solution

Please see my commit 'make demo run simpler' which should have solved this. You may want to re-installed the package again

@Rainiefantasy
Copy link
Contributor

The 'system.file' syntax is only used for package data. Let me see if it would be simple to change the code so that you don't have to give any inputs when running browseMetadata in demo mode (as like mapMetadata).

Yes this would be great! if it's a default then it would be great to have that as the default parameter so you don't have to specify it :)

@Rainiefantasy
Copy link
Contributor

Please see my commit 'make demo run simpler' which should have solved this. You may want to re-installed the package again

Thanks for fixing :) Pulled changes and reinstalled - I'm going to try rerun now

@Rainiefantasy
Copy link
Contributor

It's worked :-) great!
Ran:
> non_demo_json_file <- "/Users/mmohammad/Documents/GIT/browse-SAIL/inst/inputs/annual_district_birth_extract_(adbe)_20230908T111217.json"
> browseMetadata(json_file = non_demo_json_file, output_dir = NULL)

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 15, 2024

Maybe this is being picky, but in terms of storage it may be nicer to have the BROWSE_datasetX_V.csv in a more normalised format, i.e. trying to remove duplicate rows if possible. So instead of

  • Empty, Table, N_Variables
  • Yes, Blood_test, 2
  • No, Blood_test, 6

Having something like this:

  • Table, Empty, Total
  • Blood_test, 2, 8

Means there's less data to store/less rows but it's got the same info :)

Not necessary though, so please ignore if it's a bit long to implement!

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 15, 2024

Continued testing browseMetadata

Then use some different inputs files e.g. ADBE and EDUW datasets
For each run above, check there are (1) 3 file outputs (2) they are in the output dir you expect and (3) the html outputs can be opened in the browser and they look sensible

  1. ADBE dataset, i.e. annual_district_birth_extract_(adbe)_20230908T111217.json
  • three outputs, saved in the directory expected, and html outputs (bar chart and dataset description) rendering correctly in browser. All 3 outputs look good
  1. Education dataset, i.e. education_wales_(eduw)_20230911T163539.json
  • three outputs, saved in directory expected, and html outputs (bar chart and dataset description) rendering correctly in browser.

@Rainiefantasy
Copy link
Contributor

Testing mapMetadata.R

First run in demo mode
Process 1 table in a run, delete outputs
Process 1 table in a run, keep outputs
Process 2 tables in a run, check that the COPY function kicks in
Then run outside of demo mode (changing the input arguments in various ways to ensure function still works)
This is the main function to test (the other 3 are much simpler) so please test it with many variations :)

  1. Demo mode works great!
  • Processed table 3 of 13:
✔ Final categorisations saved in:
OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_REFR_IMM_VAC_2024-10-15-16-52-02.csv
✔ Session log saved in:
LOG_NationalCommunityChildHealthDatabase(NCCHD)_REFR_IMM_VAC_2024-10-15-16-52-02.csv
✔ A summary plot has been saved:
PLOT_NationalCommunityChildHealthDatabase(NCCHD)_REFR_IMM_VAC_2024-10-15-16-52-02.png

outputs look good, then deleted
2. Have kept outputs and reran the function - copy function works as expected :)
3. Changed lookup to only include some data elements, i.e.:

DataElement, DomainLabel, DomainCode
NA,No Match / Unsure,0
AVAIL_FROM_DT,Metadata,1
ALF_E,ID,2
MOTHER_ALF_E,ID,2

and can see that only those matched from lookup are autocategorised:

DataElement Domain_code             Note
1         ALF_E           2 AUTO CATEGORISED
6 AVAIL_FROM_DT           1 AUTO CATEGORISED

ℹ These are the auto categorised data elements. Enter row numbers for those you want to edit:

Output also reflects this:
image

  • Question: lookup table specified overrides any auto categorisation from previous table outputs - is this the behaviour you would want? I.e. would you prefer the lookup table to be prioritised as opposed to user specification for a domain categorisation?

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 15, 2024

  1. Changing the lookup file to include domains, which aren't included in the domain list file, eg line 4:
Screenshot 2024-10-15 at 17 24 51

However this still renders and you'll see the domain code 111 in the plot, without a key.
Screenshot 2024-10-15 at 17 25 46

Not sure if you want any functionality to catch out if people are using lookup files with erroneous domain codes/that don't exist in the domain list? see what you think! 😸

@Rainiefantasy
Copy link
Contributor

  1. Changing domain list file and look up file and can see the changes for new domain codes.
    lookup file:
Screenshot 2024-10-16 at 12 20 02

domain list file:
Screenshot 2024-10-16 at 12 21 12

plots tab after running:
mapMetadata(look_up_file='.../inst/inputs/look_up_test.csv', domain_file='.../inst/inputs/domain_list_demo.csv',json_file ='.../inst/inputs/national_community_child_health_database_(ncchd)_20240405T130125.json')
Screenshot 2024-10-16 at 12 19 02

ALF_E data element categorised correctly in output:
Screenshot 2024-10-16 at 12 23 43

Plot looks correct too, i.e. a bar for new domain for the right quantity:

PLOT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52

@Rainiefantasy
Copy link
Contributor

Testing mapMetadata_convert_outputs function:

Ran mapMetadata_convert_outputs(output_csv = 'OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52.csv', output_dir = '.../test-browseMetadata/')

Output format:
L-OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52.csv
as expected,

Check that the above call outputs L-OUTPUT_xxx.csv and that any rows that had multiple categorizations have now been split onto their own rows.

Screenshot 2024-10-16 at 12 55 26 looks good as well :)

@Rainiefantasy
Copy link
Contributor

Lastly, testing mapMetadata_compare_outputs:

mapMetadata_compare_outputs(session_dir ='.../test-browseMetadata', session1_base='NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52',session2_base = 'NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-56-01', json_file = '.../inst/inputs/national_community_child_health_database_(ncchd)_20240405T130125.json', domain_file='.../inst/inputs/domain_list_demo.csv')

prints output:
✔ Your concensus categorisations have been saved to CONCENSUS_OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-14-40-27.csv

looks good!

Output file:
Screenshot 2024-10-16 at 14 44 46

Minor points:

  • the timestamp looks like a non-standard format? would it be possible to standardise this? for eg dd/mm/yy hh:mm:ss or YYYY/MM/DD hh:mm:sss.000 , something like that to make dinstinction between the date and time stamp clear?
  • Do we also want a timestamp of when the concensus was taken?
  • minor typo - I think it's consensus not concensus? :)

@Rainiefantasy
Copy link
Contributor

When domains don't match,
Screenshot 2024-10-16 at 14 53 46
Domain code join also looks good, after giving consensus:
Screenshot 2024-10-16 at 14 54 10

I think that's it - all looks good to me, really great work 😸 sorry for the very very long chaotic response, hopefully it's mostly 'thumbs up' and shouldn't require much work!
❤️

@RayStick
Copy link
Member Author

RayStick commented Oct 16, 2024

@Rainiefantasy
Thanks SO much for all your testing. So valuable.
We have chatted on slack about all the above queries, and I have resolved queries/suggested changes.
The dev checks show no warnings or errors 🥳
If you can have one more check of the README (I just updated it) - feedback if there are any final edits, but if not please check 'approve' on your review so I know I can merge :D

Copy link
Contributor

@Rainiefantasy Rainiefantasy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this PR! 😸

@RayStick RayStick merged commit a8b6985 into main Oct 17, 2024
2 checks passed
@RayStick RayStick deleted the big-refactor branch October 17, 2024 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement Feature improvement or addition internal Changes related to GH workflows, actions, apps
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a test for the function with testthat
2 participants