-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211
Conversation
@@ -158,28 +159,30 @@ def keyval_from_csv(conn, script_params): | |||
# Needs omero-py 5.9.1 or later | |||
temp_name = temp_file.name | |||
with open(temp_name, 'rt', encoding='utf-8-sig') as file_handle: | |||
file_length = len(file_handle.read(-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reading the whole file to get the length, you can get this from the original_file object above (before you open the file)..
file_length = original_file.size.val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I wrote this I was almost sure that a more elegant solution would exist.
Thanks!
if parsing fails at |
The build is failing with flake8 errors:
|
Yes, and no. I think that might be a general discussion. Why not immediatly read 100% and take the safest route? But, if you have strong feelings about this then I will kill the 75% checkpoint. |
Yes, I did quite some testing, primarily with fringe .csvs where lots of cells where missing or other delimiters where intermingled. |
bump |
Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions? |
Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at https://ome-contributing.readthedocs.io/en/latest/cla.html |
Would you be okay with another fourth nested try for reading the whole file? |
here you go: |
Great, thanks for that. This PR will be included in an upcoming OMERO.server release. |
Hello,
as detaild in #210 I implemented 3 dynamic checkpoints (at 25%/50%/75%) instead of fixed 500/1000/2000 characters.
I tried it with a fairly big .csv of 1.5 million characters and it worked as fast as the original solution.
I also tried around a bit with a smaller .csv with a lot of deletions and random other delimiters in between.
It performed reasonably well for such fringe cases, meaning it could not resolve the correct delimiter in 2 of 10 or so cases.
Please take a look and tell me what you think.
Thank you for your time!