Unit testing, renaming functions, adding docs #124

RayStick · 2024-09-13T15:52:53Z

Closes #38

Overall description

This PR adds unit testing for this R package.

In order to do this, the main package function(s) needed to be split up into smaller functions, in order to test. When writing tests, it was sometimes obvious that functions should be improved in various ways, and a few sub-optimal coding things were found and corrected.

It also became obvious that there were 2 parts to this package as a whole - the browsing and the mapping. This has been made clearer now, by splitting up two functions, and adding some nice plots.

Changes implemented (by directory)

Parent directory:

Updated the .Rbuildignore and .gitignore files to better reflect the package structure
- there were 3 .gitignore files and I have now combined into 1
DESCRIPTION and NAMESPACE reflect changes to the package, specifically to do with how imports and exports are handled
README.md reflects the new code changes

`R` directory:

There are now 4 main functions that a user can interact with:

browseMetadata.R (new!)
mapMetadata.R(previously domain_mapping.R)
mapMetadata_compare_outputs.R (previously compare_sessions.R)
mapMetadata_convert_outputs.R(previously convert_outputs.R)

All of the other files in the R directory are either:

Sub-functions that are called within one of the 4 functions above
Package data

`data` directory:

The same 2 dataframes were made multiple times across the functions. Therefore, to reduce lines of code, they have now been included in the package data:

data/log_Output.rda
data/Output.rda

These can now be read into the functions with one line e.g. Output <- get("Output")

`inst` directory:

Some example inputs and outputs have been moved or added, so they can be referenced in the README, or other documentation throughout the package.

`man` directory:

All new functions require documentation via .Rd files

`tests` directory:

All functions require unit tests, written with the testthat package.

Checklist to make it ready for review (for @RayStick):

The title of this PR is clear and self-explantory.
I added any appropriate labels to this PR.
Appropriately handle imports across /R and /test directory
Split browseMetadata.R into smaller functions?
Ensure all unit tests pass
Do user testing to ensure all functions that user will interact with still work as intended - write notes for reviewer as I go
Ensure README reflect the new code changes
Ensure devtools::check() is still happy

Checklist for reviewers (@Rainiefantasy )

Please feel free comment on my PR while it's a draft and give me feedback on the development!

Accept Rachael's apology for how long this PR is - we can have a 1:1 to bring context before you review
First read the new README file for an overview of what has changed from user perspective
Look at all the changed files (high-level) using the ' Changes implemented (by directory)' section above as a guide
Complete user testing of all 4 functions to ensure all functions work as intended and match the README guide (see below)
Rachael do final checks of README - after code changes have happened, and check devtools::check() is still happy

Tips for user testing (@Rainiefantasy)

Open up R Studio with nothing in your env etc.
setwd('your-path/test_dir')
remove.packages("browseMetadata") - you may need to specify a path
devtools::install_github("aim-rsf/browseMetadata", ref = 'big-refactor')
library(browseMetadata)

Testing browseMetadata.R

First run as the README suggests (https://github.com/aim-rsf/browseMetadata/blob/big-refactor/README.md#browsemetadatar) using package files and no outputdir
Continue to use package files, but change the output_dir
Then use some different inputs files e.g. ADBE and EDUW datasets
For each run above, check there are (1) 3 file outputs (2) they are in the output dir you expect and (3) the html outputs can be opened in the browser and they look sensible

Testing mapMetadata.R

First run in demo mode
- Process 1 table in a run, delete outputs
- Process 1 table in a run, keep outputs
- Process 2 tables in a run, check that the COPY function kicks in
Then run outside of demo mode (changing the input arguments in various ways to ensure function still works)
This is the main function to test (the other 3 are much simpler) so please test it with many variations :)

Testing mapMetadata_convert_outputs.R

This function hasn't changed so only a quick test should do the trick:

mapMetadata_convert_outputs(output_csv = 'OUTPUT_xxx.csv', output_dir = /path/test_dir/')

Check that the above call outputs L-OUTPUT_xxx.csv and that any rows that had multiple categorizations have now been split onto their own rows.

Testing mapMetadata_compare_outputs.R

This function has only changed a little but a quick test should do the trick
Remember you have the files in the inst/inputs folder to point towards as quick inputs for some of the arguments

Rainiefantasy · 2024-10-14T12:00:11Z

Just a small query:
When I run: browseMetadata(json_file = demo_json_file)
I get:
ℹ Three outputs have been saved to your output directory.
ℹ Open the two html files in your browser for full screen viewing.
$table_fig
$empty_fig
returned. What is the $empty_fig returned for?

RayStick · 2024-10-14T13:11:38Z

Just a small query:
When I run: browseMetadata(json_file = demo_json_file)
I get:
ℹ Three outputs have been saved to your output directory.
ℹ Open the two html files in your browser for full screen viewing.
$table_fig
$empty_fig
returned. What is the $empty_fig returned for?

Good question. It is because these are the names of the two plots 'table_fig' and 'empty_fig' in the code. 'table_fig' is the html output that returns the table, and 'empty_fig' is the bar chart that counts how many variables have empty descriptions. I am not sure how to suppress this console output whilst also allowing the figures to be returned in the 'Viewer' tab but there is likely a way, if we wanted to do that. Also, if 'empty_fig' is a confusing name we could call it 'barplot_fig' instead. Perhaps this is even clearer:
$table_html
$barplot_html

Rainiefantasy · 2024-10-14T13:26:41Z

browseMetadata()

tested with different directories, valid & invalid and both work as expected
tested with invalid file, threw error, as expected:
Error in readLines(file, warn = FALSE) : 'con' is not a connection
mapMetadata()
input wrong format, throws error as expected:
✖ Your input is in the wrong format. Reference the allowable list of integers and try again.
input out of range, throws error as expected:
✖ One of your inputs is out of range! Reference the allowable list of integers and try again.
Ignore below - solved by doing the following:

1. Open up R Studio with nothing in your env etc.
2. setwd('your-path/test_dir')
3. remove.packages("browseMetadata") - you may need to specify a path
4. devtools::install_github("aim-rsf/browseMetadata", ref = 'big-refactor')
5. library(browseMetadata)

as per your suggestion, and it works now!
Did I miss something, I hope I didn't do something wrong!
When running mapMetadata() I'm getting an error thrown when I process a table (table 4, here):
Optional free text note about this table (or press enter to continue): n ℹ There are 11 data elements (variables) in this table. ℹ 11 left to process ℹ Data element 1 of 11 Error in add_row(): ! New rows can't add columns. ✖ Can't find columns DataElement, DataElement_N, Domain_code, and Notein.data. Run rlang::last_trace() to see where the error occurred.

Rainiefantasy · 2024-10-14T13:53:47Z

Continuing testing mapMetadata

out of range values for domain mapping, error as expected:
! Formatting is invalid or integer out of range. Provide one integer or a comma seperated list of integers.
Out of range re-categorisation, error as expected:
✖ One of your inputs is out of range! Reference the allowable list of integers and try again.
Input in wrong format, error as expected:
<simpleError in scan(file = "", what = 0): scan() expected 'a real', got 'mapMetadata()'> ✖ Your input is in the wrong format. Reference the allowable list of integers and try again.
Recategorised BIRTH_ORDER variable domain from 7 -> 1 to check if it shows up with new domain code, and it does. :)

If processing same table more than once in the same session, it skips duplicates, which is good!
If processing same table in a new session, it recognises the history of the previous one and auto-categorises as expected, which is good :)!

Copying from previous session(s): [1] "OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-14-14-46-03.csv"
And in output, notes column:
COPIED FROM: CHILD

Rainiefantasy · 2024-10-14T16:57:17Z

Question - running:
demo_json_file <- system.file("inputs/national_community_child_health_database_(ncchd)_20240405T130125.json", package = "browseMetadata")
followed by
browseMetadata(json_file = demo_json_file, output_dir = NULL)
works fine,

When changing demo file:
demo_json_file <- system.file("inputs/**annual_district_birth_extract_(adbe)_20230908T111217.json", package = "browseMetadata**")
followed by this:
browseMetadata(json_file = demo_json_file, output_dir = NULL)
I get this error:
Error in fromJSON(file = json_file) : attempt to set index 1/1 in SET_STRING_ELT In addition: Warning message: In file(con, "r") : file("") only supports open = "w+" and open = "w+b": using the former
Doesn't work? Not sure if I did something wrong 😸

Rainiefantasy · 2024-10-14T16:59:36Z

The arguments in the documentation is also a bit unclear - i.e. under Arguments for browseMetadata:
The metadata file. This should be a json download from the metadata catalogue. By default, 'data/json_metadata.rda' is used - run '?json_metadata' to see how it was created.
It's unclear if I need to rerun the steps specified in ?json_metadata, as it says by default the rda file is used, but pretty sure it's the json right?
Let me know if that doesn't make sense and I can explain more!

RayStick · 2024-10-15T07:12:30Z

Question - running:

Ah, I see the confusion. It should be:
non_demo_json_file <- "inputs/**annual_district_birth_extract_(adbe)_20230908T111217.json".
browseMetadata(json_file = non_demo_json_file, output_dir = NULL)

The 'system.file' syntax is only used for package data. Let me see if it would be simple to change the code so that you don't have to give any inputs when running browseMetadata in demo mode (as like mapMetadata).

The arguments in the documentation is also a bit unclear - i.e. under Arguments for browseMetadata:

Nice catch! This is an error. I copied this over from mapMetadata

Solution

Please see my commit 'make demo run simpler' which should have solved this. You may want to re-installed the package again

Rainiefantasy · 2024-10-15T10:16:36Z

The 'system.file' syntax is only used for package data. Let me see if it would be simple to change the code so that you don't have to give any inputs when running browseMetadata in demo mode (as like mapMetadata).

Yes this would be great! if it's a default then it would be great to have that as the default parameter so you don't have to specify it :)

Rainiefantasy · 2024-10-15T10:17:43Z

Please see my commit 'make demo run simpler' which should have solved this. You may want to re-installed the package again

Thanks for fixing :) Pulled changes and reinstalled - I'm going to try rerun now

Rainiefantasy · 2024-10-15T10:22:38Z

It's worked :-) great!
Ran:
> non_demo_json_file <- "/Users/mmohammad/Documents/GIT/browse-SAIL/inst/inputs/annual_district_birth_extract_(adbe)_20230908T111217.json"
> browseMetadata(json_file = non_demo_json_file, output_dir = NULL)

Rainiefantasy · 2024-10-15T10:29:26Z

Maybe this is being picky, but in terms of storage it may be nicer to have the BROWSE_datasetX_V.csv in a more normalised format, i.e. trying to remove duplicate rows if possible. So instead of

Empty, Table, N_Variables
Yes, Blood_test, 2
No, Blood_test, 6

Having something like this:

Table, Empty, Total
Blood_test, 2, 8

Means there's less data to store/less rows but it's got the same info :)

Not necessary though, so please ignore if it's a bit long to implement!

Rainiefantasy · 2024-10-15T10:50:16Z

Continued testing browseMetadata

Then use some different inputs files e.g. ADBE and EDUW datasets
For each run above, check there are (1) 3 file outputs (2) they are in the output dir you expect and (3) the html outputs can be opened in the browser and they look sensible

ADBE dataset, i.e. annual_district_birth_extract_(adbe)_20230908T111217.json

three outputs, saved in the directory expected, and html outputs (bar chart and dataset description) rendering correctly in browser. All 3 outputs look good

Education dataset, i.e. education_wales_(eduw)_20230911T163539.json

three outputs, saved in directory expected, and html outputs (bar chart and dataset description) rendering correctly in browser.

Rainiefantasy · 2024-10-15T16:20:40Z

Testing mapMetadata.R

First run in demo mode
Process 1 table in a run, delete outputs
Process 1 table in a run, keep outputs
Process 2 tables in a run, check that the COPY function kicks in
Then run outside of demo mode (changing the input arguments in various ways to ensure function still works)
This is the main function to test (the other 3 are much simpler) so please test it with many variations :)

Demo mode works great!

Processed table 3 of 13:

✔ Final categorisations saved in:
OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_REFR_IMM_VAC_2024-10-15-16-52-02.csv
✔ Session log saved in:
LOG_NationalCommunityChildHealthDatabase(NCCHD)_REFR_IMM_VAC_2024-10-15-16-52-02.csv
✔ A summary plot has been saved:
PLOT_NationalCommunityChildHealthDatabase(NCCHD)_REFR_IMM_VAC_2024-10-15-16-52-02.png

outputs look good, then deleted
2. Have kept outputs and reran the function - copy function works as expected :)
3. Changed lookup to only include some data elements, i.e.:

DataElement, DomainLabel, DomainCode
NA,No Match / Unsure,0
AVAIL_FROM_DT,Metadata,1
ALF_E,ID,2
MOTHER_ALF_E,ID,2

and can see that only those matched from lookup are autocategorised:

DataElement Domain_code             Note
1         ALF_E           2 AUTO CATEGORISED
6 AVAIL_FROM_DT           1 AUTO CATEGORISED

ℹ These are the auto categorised data elements. Enter row numbers for those you want to edit:

Output also reflects this:

Question: lookup table specified overrides any auto categorisation from previous table outputs - is this the behaviour you would want? I.e. would you prefer the lookup table to be prioritised as opposed to user specification for a domain categorisation?

Rainiefantasy · 2024-10-15T16:26:56Z

Changing the lookup file to include domains, which aren't included in the domain list file, eg line 4:

However this still renders and you'll see the domain code 111 in the plot, without a key.

Not sure if you want any functionality to catch out if people are using lookup files with erroneous domain codes/that don't exist in the domain list? see what you think! 😸

Rainiefantasy · 2024-10-16T11:26:56Z

Changing domain list file and look up file and can see the changes for new domain codes.
lookup file:

domain list file:

plots tab after running:
mapMetadata(look_up_file='.../inst/inputs/look_up_test.csv', domain_file='.../inst/inputs/domain_list_demo.csv',json_file ='.../inst/inputs/national_community_child_health_database_(ncchd)_20240405T130125.json')

ALF_E data element categorised correctly in output:

Plot looks correct too, i.e. a bar for new domain for the right quantity:

Rainiefantasy · 2024-10-16T11:55:39Z

Testing mapMetadata_convert_outputs function:

Ran mapMetadata_convert_outputs(output_csv = 'OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52.csv', output_dir = '.../test-browseMetadata/')

Output format:
L-OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52.csv
as expected,

Check that the above call outputs L-OUTPUT_xxx.csv and that any rows that had multiple categorizations have now been split onto their own rows.

looks good as well :)

Rainiefantasy · 2024-10-16T13:48:47Z

Lastly, testing mapMetadata_compare_outputs:

mapMetadata_compare_outputs(session_dir ='.../test-browseMetadata', session1_base='NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-17-52',session2_base = 'NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-12-56-01', json_file = '.../inst/inputs/national_community_child_health_database_(ncchd)_20240405T130125.json', domain_file='.../inst/inputs/domain_list_demo.csv')

prints output:
✔ Your concensus categorisations have been saved to CONCENSUS_OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_CHILD_2024-10-16-14-40-27.csv

looks good!

Output file:

Minor points:

the timestamp looks like a non-standard format? would it be possible to standardise this? for eg dd/mm/yy hh:mm:ss or YYYY/MM/DD hh:mm:sss.000 , something like that to make dinstinction between the date and time stamp clear?
Do we also want a timestamp of when the concensus was taken?
minor typo - I think it's consensus not concensus? :)

Rainiefantasy · 2024-10-16T13:55:35Z

When domains don't match,

Domain code join also looks good, after giving consensus:

I think that's it - all looks good to me, really great work 😸 sorry for the very very long chaotic response, hopefully it's mostly 'thumbs up' and shouldn't require much work!
❤️

RayStick · 2024-10-16T16:37:38Z

@Rainiefantasy
Thanks SO much for all your testing. So valuable.
We have chatted on slack about all the above queries, and I have resolved queries/suggested changes.
The dev checks show no warnings or errors 🥳
If you can have one more check of the README (I just updated it) - feedback if there are any final edits, but if not please check 'approve' on your review so I know I can merge :D

Rainiefantasy

Approving this PR! 😸

big refactor - renaming and adding tests

d6da8f2

github-actions bot assigned RayStick Sep 13, 2024

github-actions bot added the documentation Improvements or additions to documentation label Sep 13, 2024

RayStick added 5 commits September 13, 2024 16:56

add more example files

89d33f6

shorten

5b91a72

correct

c6f6b54

simplify

442c344

shorten!

9ffb2aa

RayStick changed the title ~~big refactor - renaming and adding tests~~ Unit testing, renaming functions, adding docs Sep 13, 2024

RayStick added the enhancement Feature improvement or addition label Sep 13, 2024

RayStick added this to the before rOpenSci milestone Sep 13, 2024

RayStick requested a review from Rainiefantasy September 13, 2024 16:48

RayStick mentioned this pull request Sep 13, 2024

(OLD) Unit testing, renaming functions, adding docs #121

Closed

RayStick added 17 commits September 16, 2024 10:56

correct

ec49724

check imports

81c7592

check imports

48ae692

check imports

0f9e086

check imports

96621a7

check imports

d886130

check imports

6e8cdc6

check imports

f24820d

check imports

af2d914

check imports

0e7f649

check imports

4b6fa54

check imports

2fe2f3d

check imports

483920a

check imports

c0b8e29

check imports

11cee22

check imports

cc04700

check imports

a18b2a4

responding to MM comment

9ec6b57

RayStick added 2 commits October 14, 2024 14:04

output clarity

4f230cd

clearer syntax

7789b4d

make demo run simpler

ade68c6

RayStick added 4 commits October 16, 2024 15:35

rename plot names for more clarity in console output

db4c509

standardise timestamp within the csv files

7dcc971

correct sp.mistake

30f2cd8

explain function arguments better

e0b9683

Rainiefantasy approved these changes Oct 17, 2024

View reviewed changes

RayStick merged commit a8b6985 into main Oct 17, 2024
2 checks passed

RayStick deleted the big-refactor branch October 17, 2024 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit testing, renaming functions, adding docs #124

Unit testing, renaming functions, adding docs #124

RayStick commented Sep 13, 2024 •

edited

Loading

Rainiefantasy commented Oct 14, 2024 •

edited

Loading

RayStick commented Oct 14, 2024 •

edited

Loading

Rainiefantasy commented Oct 14, 2024 •

edited

Loading

Rainiefantasy commented Oct 14, 2024 •

edited

Loading

Rainiefantasy commented Oct 14, 2024

Rainiefantasy commented Oct 14, 2024

RayStick commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 16, 2024

Rainiefantasy commented Oct 16, 2024

Rainiefantasy commented Oct 16, 2024

Rainiefantasy commented Oct 16, 2024

RayStick commented Oct 16, 2024 •

edited

Loading

Rainiefantasy left a comment

Unit testing, renaming functions, adding docs #124

Unit testing, renaming functions, adding docs #124

Conversation

RayStick commented Sep 13, 2024 • edited Loading

Overall description

Changes implemented (by directory)

Parent directory:

R directory:

data directory:

inst directory:

man directory:

tests directory:

Checklist to make it ready for review (for @RayStick):

Checklist for reviewers (@Rainiefantasy )

Tips for user testing (@Rainiefantasy)

Rainiefantasy commented Oct 14, 2024 • edited Loading

RayStick commented Oct 14, 2024 • edited Loading

Rainiefantasy commented Oct 14, 2024 • edited Loading

Rainiefantasy commented Oct 14, 2024 • edited Loading

Rainiefantasy commented Oct 14, 2024

Rainiefantasy commented Oct 14, 2024

RayStick commented Oct 15, 2024 • edited Loading

Solution

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024 • edited Loading

Rainiefantasy commented Oct 15, 2024 • edited Loading

Rainiefantasy commented Oct 15, 2024

Rainiefantasy commented Oct 15, 2024 • edited Loading

Rainiefantasy commented Oct 16, 2024

Rainiefantasy commented Oct 16, 2024

Rainiefantasy commented Oct 16, 2024

Rainiefantasy commented Oct 16, 2024

RayStick commented Oct 16, 2024 • edited Loading

Rainiefantasy left a comment

Choose a reason for hiding this comment

RayStick commented Sep 13, 2024 •

edited

Loading

`R` directory:

`data` directory:

`inst` directory:

`man` directory:

`tests` directory:

Rainiefantasy commented Oct 14, 2024 •

edited

Loading

RayStick commented Oct 14, 2024 •

edited

Loading

Rainiefantasy commented Oct 14, 2024 •

edited

Loading

Rainiefantasy commented Oct 14, 2024 •

edited

Loading

RayStick commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 15, 2024 •

edited

Loading

Rainiefantasy commented Oct 15, 2024 •

edited

Loading

RayStick commented Oct 16, 2024 •

edited

Loading