Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211

JensWendt · 2023-09-06T20:41:26Z

Hello,

as detaild in #210 I implemented 3 dynamic checkpoints (at 25%/50%/75%) instead of fixed 500/1000/2000 characters.

I tried it with a fairly big .csv of 1.5 million characters and it worked as fast as the original solution.

I also tried around a bit with a smaller .csv with a lot of deletions and random other delimiters in between.
It performed reasonably well for such fringe cases, meaning it could not resolve the correct delimiter in 2 of 10 or so cases.

Please take a look and tell me what you think.
Thank you for your time!

will-moore · 2023-09-07T11:01:50Z

omero/annotation_scripts/KeyVal_from_csv.py

@@ -158,28 +159,30 @@ def keyval_from_csv(conn, script_params):
        # Needs omero-py 5.9.1 or later
        temp_name = temp_file.name
        with open(temp_name, 'rt', encoding='utf-8-sig') as file_handle:
+            file_length = len(file_handle.read(-1))


Instead of reading the whole file to get the length, you can get this from the original_file object above (before you open the file)..

file_length = original_file.size.val

Yes, when I wrote this I was almost sure that a more elegant solution would exist.
Thanks!

will-moore · 2023-09-07T11:05:06Z

if parsing fails at 50% of the file, maybe it makes sense to simply read ALL of the file for the final try?
Hopefully this won't be needed very often, but if it is then I expect most users would be willing to wait a tiny bit longer to give them the best chance of correctly reading the file?

will-moore · 2023-09-07T11:08:55Z

The build is failing with flake8 errors:

./omero/annotation_scripts/KeyVal_from_csv.py:167:21: E128 continuation line under-indented for visual indent
./omero/annotation_scripts/KeyVal_from_csv.py:172:80: E501 line too long (81 > 79 characters)
./omero/annotation_scripts/KeyVal_from_csv.py:174:25: E128 continuation line under-indented for visual indent
./omero/annotation_scripts/KeyVal_from_csv.py:179:80: E501 line too long (88 > 79 characters)
./omero/annotation_scripts/KeyVal_from_csv.py:181:25: E128 continuation line under-indented for visual indent

will-moore · 2023-09-07T11:17:49Z

Testing on a small csv file worked fine for me. I'm not going to do extensive testing on lots of csvs, but I can confirm with the test csv (on issue above), the original code doesn't parse the file correctly but this code does.

JensWendt · 2023-09-07T11:24:55Z

if parsing fails at 50% of the file, maybe it makes sense to simply read ALL of the file for the final try? Hopefully this won't be needed very often, but if it is then I expect most users would be willing to wait a tiny bit longer to give them the best chance of correctly reading the file?

Yes, and no. I think that might be a general discussion. Why not immediatly read 100% and take the safest route?
Admittedly, the checkpoints are arbitrary and not chosen with a sound reasoning.
But we are talking about timespans in the single digit milliseconds (according to %%timeit in the juypter notebook Sandbox that I use for trying this out). So it might not be too important where to put the checkpoints?

But, if you have strong feelings about this then I will kill the 75% checkpoint.

JensWendt · 2023-09-07T11:26:41Z

Testing on a small csv file worked fine for me. I'm not going to do extensive testing on lots of csvs, but I can confirm with the test csv (on issue above), the original code doesn't parse the file correctly but this code does.

Yes, I did quite some testing, primarily with fringe .csvs where lots of cells where missing or other delimiters where intermingled.
But in the end we will have to rely on the community as testers and if something weird comes up I will try and fix it quickly.

JensWendt · 2023-09-28T09:44:58Z

bump

will-moore · 2023-09-28T13:16:36Z

Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions?
But I don't have any evidence or stats on how much more effective it would be, so happy to go for what you've got.

will-moore · 2023-09-29T09:20:43Z

Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at https://ome-contributing.readthedocs.io/en/latest/cla.html
This is a requirement to cover all of OME's GPL-licensed projects
Many thanks!

JensWendt · 2023-09-29T14:56:19Z

Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions? But I don't have any evidence or stats on how much more effective it would be, so happy to go for what you've got.

Would you be okay with another fourth nested try for reading the whole file?
I mean it seems I will be touching the script again soon anyways.

JensWendt · 2023-09-29T15:07:11Z

Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at ome-contributing.readthedocs.io/en/latest/cla.html This is a requirement to cover all of OME's GPL-licensed projects Many thanks!

here you go:
Binder1.pdf

will-moore · 2023-09-29T15:16:08Z

Great, thanks for that.

This PR will be included in an upcoming OMERO.server release.
Unfortunately the scripts don't get released unless we release the server, which is a limitation. But feel free to continue to improve the script.

Solution for delimiter reading issue utilizing dynamic checkpoints

dd2ffab

will-moore reviewed Sep 7, 2023

View reviewed changes

resolving Flake8 issues and implements better method to get file_length

8738ad1

will-moore approved these changes Sep 28, 2023

View reviewed changes

will-moore merged commit ce9f5c9 into ome:develop Sep 29, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211

Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211

JensWendt commented Sep 6, 2023

will-moore Sep 7, 2023

JensWendt Sep 7, 2023

will-moore commented Sep 7, 2023

will-moore commented Sep 7, 2023

will-moore commented Sep 7, 2023

JensWendt commented Sep 7, 2023

JensWendt commented Sep 7, 2023

JensWendt commented Sep 28, 2023

will-moore commented Sep 28, 2023

will-moore commented Sep 29, 2023

JensWendt commented Sep 29, 2023

JensWendt commented Sep 29, 2023

will-moore commented Sep 29, 2023

Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211

Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211

Conversation

JensWendt commented Sep 6, 2023

will-moore Sep 7, 2023

Choose a reason for hiding this comment

JensWendt Sep 7, 2023

Choose a reason for hiding this comment

will-moore commented Sep 7, 2023

will-moore commented Sep 7, 2023

will-moore commented Sep 7, 2023

JensWendt commented Sep 7, 2023

JensWendt commented Sep 7, 2023

JensWendt commented Sep 28, 2023

will-moore commented Sep 28, 2023

will-moore commented Sep 29, 2023

JensWendt commented Sep 29, 2023

JensWendt commented Sep 29, 2023

will-moore commented Sep 29, 2023