Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work the csv output logs & validate user inputs #104

Merged
merged 9 commits into from
Jul 8, 2024
Merged

Conversation

RayStick
Copy link
Member

@RayStick RayStick commented Jul 4, 2024

❗ Sorry, this is a big PR ... not the best practice ...

Closes #102

  • Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions
  • When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.
  • Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted
  • See the README which reflects these changes to the code

Closes #91

  • See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

Closes #97

  • From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Closes #82

  • This commit 0dc8c36 made changes to the user_categorisation to validate and standardize user inputs
  • Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Other changes:

  • DESCRIPTION file version number brought up to date
  • Fixed a bug in compare_csv_outputs and validated that all the checks work

Checklist before review:

  • Please comment on my PR while it's a draft and give me feedback on the development!
  • I added everything I wanted to add to this PR, please review!
  • The title of this PR is clear and self-explantory.
  • I added any appropriate labels to this PR.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 4, 2024
@RayStick RayStick requested a review from Rainiefantasy July 4, 2024 09:16
@RayStick RayStick added enhancement Feature improvement or addition bug-fix labels Jul 4, 2024
@RayStick RayStick changed the title Track version Re-work the csv output logs and how they compared Jul 4, 2024
@RayStick RayStick changed the title Re-work the csv output logs and how they compared Re-work the csv output logs and how they are compared Jul 4, 2024
@RayStick RayStick marked this pull request as ready for review July 4, 2024 09:21
@RayStick RayStick changed the title Re-work the csv output logs and how they are compared Re-work the csv output logs & validate user inputs Jul 4, 2024
@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Jul 8, 2024

Package version number has now been included in the csv output logs. This could be important to track which version number created the csv outputs, particularly if the compare_csv_outputs command is being used to compare outputs created from different package versions

Is this the browseMetadata column - I can see v 1.0.1 so that's great 😸 !

When doing this, I realised it made sense to split the csv output log into two files not one - one file is about the session itself (LOG) and the other csv file contains the actually categorisations made by the user (OUTPUT). Previously there was a lot of repetition in the OUTPUT csv file. They can be linked via timestamp column.

That's fair enough - the timestamp is in the filename as well so makes sense can use it as an identifier 👍 Can see both files, looks good!

Because there are now two csv outputs, the input arguments for compare_csv_outputs had to be adjusted
See the README which reflects these changes to the code

When trying to run the compare function:
compare_csv_outputs('LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-05_16-07-38.599493.csv','LOG_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','OUTPUT_NationalCommunityChildHealthDatabase(NCCHD)_BLOOD_TEST_2024-07-08_12-03-30.429336.csv','Documents/GIT/browse-SAIL/data-raw/national_community_child_health_database_(ncchd)_20240405T130125.json')
I get a prompt for a mismatch and to 'ℹ Provide concensus decision for this DataElement:' but when I enter a number (one of the two domain classifications) it errors with:
Error in user_categorisation(selectTable_df$Label[datavar], selectTable_df$Description[datavar], : argument "domain_code_max" is missing, with no default
Not sure if I'm doing something wrong here?
Other than that, I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?

@Rainiefantasy
Copy link
Contributor

See line 143 of domain_mapping which allows the user to add an optional free text note about this table. They always have the option to skip a table and not process it as well

This looks good to me! ✨

From working on compare_csv_outputs I don't see an issue with two output files having the same initials and therefore think this issue can be closed - let me know if you spot anything.

Sounds good! l agree- means the function can be used as a sanity check for a user as well :)

This commit 0dc8c36 made changes to the user_categorisation to validate and standardize user inputs.
Smaller changes were made to domain_mapping as there is now a 4th argument for user_categorisation

Only looked briefly at code here but looks ok at glance 👍

DESCRIPTION file version number brought up to date
Fixed a bug in compare_csv_outputs and validated that all the checks work
I seem to still have issues with that function, can double check I've got the latest version

@Rainiefantasy
Copy link
Contributor

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

@RayStick
Copy link
Member Author

RayStick commented Jul 8, 2024

Another minor feedback - I may have suggested it'd be good to have a choice to amend classifications in case of a mistake, but the more I'm using it, I'm wondering if it's necessary after every classification. Personally I think having just one prompt at the end to check your classifications and if you're happy or want to amend, is enough. See what you think though this is just my experience in using it 😸

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

@RayStick
Copy link
Member Author

RayStick commented Jul 8, 2024

@Rainiefantasy your errors with the compare_csv_outputs.R function were due to me not updating the call to user_categorisation function. See line 134 of 9cad2f6 where I have now added 'max(Code$Code)' as an input argument, and this solved the error you got.

  • adding 'max(Code$Code)' meant I needed to add a new input argument to this function, 'domain_file' (see line 10 and 15) and the validation check can be removed (old lines 30-34) because it is now just assumed the same domain_file was used in each session
  • reading in the domain_list file allows for the plot to be shown again, like in domain_mapping (lines 83-90)

To answer your question 'I'm not sure why you need all four files as opposed to just having the two classification files as you are comparing the output - was there a reason for this?'

  • the LOG_ csv files are read in in order to perform all the validation checks i.e. is it valid for these two csv files to be compared
  • if all validation checks are passed then the two OUTPUT_ csv files are compared
  • does that make sense?
  • I made changes to the input argument structure (see lines 6-10, line 15, 22-27) to make it more intuitive and less repetitive

Finally, I changed the function name to compare_sessions

What do you think? Can you run again with same inputs please and see if it no longer errors? Thanks!

Copy link
Contributor

@Rainiefantasy Rainiefantasy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wonderful it works now 😃 ! Worked nicely and I like how you don't have to repeat the filenames for both the log and output 🚀 . Approved for merge! 🤞 (And thank you for explaining the reasoning behind why the files are needed. :)

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Jul 8, 2024

I wondered if it was necessary as well, but actually both Dan and Nida suggested this feature would be useful. I think when I am in 'test mode' I find the amount of prompts given to me as the user annoying XD but that's because I am doing it quickly. Going through it properly with a real use-case, I can see the extra functionality to correct things can be good, so I'll leave this functionality in for now.

Just spotted this, that's no problem - it's mostly a preference thing so I guess it doesn't matter. 😆 But I agree, it's probably because of the testing you notice it more!

@RayStick
Copy link
Member Author

RayStick commented Jul 8, 2024

Thanks for reviewing this long PR @Rainiefantasy - appreciate it!

@RayStick RayStick merged commit 2f82eac into main Jul 8, 2024
2 checks passed
@RayStick RayStick deleted the track-version branch July 8, 2024 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix documentation Improvements or additions to documentation enhancement Feature improvement or addition
Projects
None yet
2 participants