Re-work the csv output logs & validate user inputs #104

RayStick · 2024-07-04T08:52:07Z

❗ Sorry, this is a big PR ... not the best practice ...

Closes #102

Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions
When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.
Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted
See the README which reflects these changes to the code

Closes #91

See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

Closes #97

From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Closes #82

This commit 0dc8c36 made changes to the user_categorisation to validate and standardize user inputs
Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Other changes:

DESCRIPTION file version number brought up to date
Fixed a bug in compare_csv_outputs and validated that all the checks work

Checklist before review:

Please comment on my PR while it's a draft and give me feedback on the development!
I added everything I wanted to add to this PR, please review!
The title of this PR is clear and self-explantory.
I added any appropriate labels to this PR.

Rainiefantasy · 2024-07-08T11:28:15Z

Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions

Is this the browseMetadata column - I can see v 1.0.1 so that's great 😸 !

When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.

That's fair enough - the timestamp is in the filename as well so makes sense can use it as an identifier 👍 Can see both files, looks good!

Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted
See the README which reflects these changes to the code

When trying to run the compare function:
compare_csv_outputs('LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','Documents/GIT/browse-SAIL/data-raw/national_community_child_health_database_(ncchd)_20240405T130125.json')
I get a prompt for a mismatch and to 'ℹ Provide concensus decision for this DataElement:' but when I enter a number (one of the two domain classifications) it errors with:
Error in user_categorisation(selectTable_df$Label[datavar], selectTable_df$Description[datavar], : argument "domain_code_max" is missing, with no default
Not sure if I'm doing something wrong here?
Other than that, I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?

Rainiefantasy · 2024-07-08T11:43:03Z

See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

This looks good to me! ✨

From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Sounds good! l agree- means the function can be used as a sanity check for a user as well :)

This commit 0dc8c36 made changes to the user_categorisation to validate and standardize user inputs.
Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Only looked briefly at code here but looks ok at glance 👍

DESCRIPTION file version number brought up to date
Fixed a bug in compare_csv_outputs and validated that all the checks work
I seem to still have issues with that function, can double check I've got the latest version

Rainiefantasy · 2024-07-08T11:54:24Z

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

RayStick · 2024-07-08T11:58:03Z

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

RayStick · 2024-07-08T12:46:48Z

@Rainiefantasy your errors with the compare_csv_outputs.R function were due to me not updating the call to user_categorisation function. See line 134 of 9cad2f6 where I have now added 'max(Code$Code)' as an input argument, and this solved the error you got.

adding 'max(Code$Code)' meant I needed to add a new input argument to this function, 'domain_file' (see line 10 and 15) and the validation check can be removed (old lines 30-34) because it is now just assumed the same domain_file was used in each session
reading in the domain_list file allows for the plot to be shown again, like in domain_mapping (lines 83-90)

To answer your question 'I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?'

the LOG_ csv files are read in in order to perform all the validation checks i.e. is it valid for these two csv files to be compared
if all validation checks are passed then the two OUTPUT_ csv files are compared
does that make sense?
I made changes to the input argument structure (see lines 6-10, line 15, 22-27) to make it more intuitive and less repetitive

Finally, I changed the function name to compare_sessions

What do you think? Can you run again with same inputs please and see if it no longer errors? Thanks!

Rainiefantasy

Ah wonderful it works now 😃 ! Worked nicely and I like how you don't have to repeat the filenames for both the log and output 🚀 . Approved for merge! 🤞 (And thank you for explaining the reasoning behind why the files are needed. :)

Rainiefantasy · 2024-07-08T13:17:16Z

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

Just spotted this, that's no problem - it's mostly a preference thing so I guess it doesn't matter. 😆 But I agree, it's probably because of the testing you notice it more!

RayStick · 2024-07-08T13:18:11Z

Thanks for reviewing this long PR @Rainiefantasy - appreciate it!

RayStick added 4 commits July 3, 2024 16:32

update to latest version

bf64d0b

split outputs & track package version

0b53bbf

change to match 2 csv inputs and fix bugs

e576c1f

update docs

1e2c0e7

github-actions bot assigned RayStick Jul 4, 2024

github-actions bot added the documentation Improvements or additions to documentation label Jul 4, 2024

update to match code changes

3532aa5

RayStick requested a review from Rainiefantasy July 4, 2024 09:16

RayStick added enhancement Feature improvement or addition bug-fix labels Jul 4, 2024

RayStick changed the title ~~Track version~~ Re-work the csv output logs and how they compared Jul 4, 2024

RayStick changed the title ~~Re-work the csv output logs and how they compared~~ Re-work the csv output logs and how they are compared Jul 4, 2024

RayStick marked this pull request as ready for review July 4, 2024 09:21

validate user input

0dc8c36

RayStick changed the title ~~Re-work the csv output logs and how they are compared~~ Re-work the csv output logs & validate user inputs Jul 4, 2024

correct sp mistake & update README

b4c7c12

RayStick added 2 commits July 8, 2024 13:34

correct error & simplify inputs

9cad2f6

rename

7fac9ed

Rainiefantasy approved these changes Jul 8, 2024

View reviewed changes

RayStick merged commit 2f82eac into main Jul 8, 2024
2 checks passed

RayStick deleted the track-version branch July 8, 2024 13:18

RayStick mentioned this pull request Jul 11, 2024

Add in table dependencies (quicker for user and more valid) #105

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-work the csv output logs & validate user inputs #104

Re-work the csv output logs & validate user inputs #104

RayStick commented Jul 4, 2024 •

edited

Loading

Rainiefantasy commented Jul 8, 2024 •

edited

Loading

Rainiefantasy commented Jul 8, 2024

Rainiefantasy commented Jul 8, 2024

RayStick commented Jul 8, 2024 •

edited

Loading

RayStick commented Jul 8, 2024 •

edited

Loading

Rainiefantasy left a comment •

edited

Loading

Rainiefantasy commented Jul 8, 2024 •

edited

Loading

RayStick commented Jul 8, 2024

Re-work the csv output logs & validate user inputs #104

Re-work the csv output logs & validate user inputs #104

Conversation

RayStick commented Jul 4, 2024 • edited Loading

Checklist before review:

Rainiefantasy commented Jul 8, 2024 • edited Loading

Rainiefantasy commented Jul 8, 2024

Rainiefantasy commented Jul 8, 2024

RayStick commented Jul 8, 2024 • edited Loading

RayStick commented Jul 8, 2024 • edited Loading

Rainiefantasy left a comment • edited Loading

Choose a reason for hiding this comment

Rainiefantasy commented Jul 8, 2024 • edited Loading

RayStick commented Jul 8, 2024

RayStick commented Jul 4, 2024 •

edited

Loading

Rainiefantasy commented Jul 8, 2024 •

edited

Loading

RayStick commented Jul 8, 2024 •

edited

Loading

RayStick commented Jul 8, 2024 •

edited

Loading

Rainiefantasy left a comment •

edited

Loading

Rainiefantasy commented Jul 8, 2024 •

edited

Loading