Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211

Merged
merged 2 commits into from
Sep 29, 2023

Conversation

JensWendt
Copy link
Contributor

Hello,

as detaild in #210 I implemented 3 dynamic checkpoints (at 25%/50%/75%) instead of fixed 500/1000/2000 characters.

I tried it with a fairly big .csv of 1.5 million characters and it worked as fast as the original solution.

I also tried around a bit with a smaller .csv with a lot of deletions and random other delimiters in between.
It performed reasonably well for such fringe cases, meaning it could not resolve the correct delimiter in 2 of 10 or so cases.

Please take a look and tell me what you think.
Thank you for your time!

@@ -158,28 +159,30 @@ def keyval_from_csv(conn, script_params):
# Needs omero-py 5.9.1 or later
temp_name = temp_file.name
with open(temp_name, 'rt', encoding='utf-8-sig') as file_handle:
file_length = len(file_handle.read(-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading the whole file to get the length, you can get this from the original_file object above (before you open the file)..

file_length = original_file.size.val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I wrote this I was almost sure that a more elegant solution would exist.
Thanks!

@will-moore
Copy link
Member

if parsing fails at 50% of the file, maybe it makes sense to simply read ALL of the file for the final try?
Hopefully this won't be needed very often, but if it is then I expect most users would be willing to wait a tiny bit longer to give them the best chance of correctly reading the file?

@will-moore
Copy link
Member

The build is failing with flake8 errors:

./omero/annotation_scripts/KeyVal_from_csv.py:167:21: E128 continuation line under-indented for visual indent
./omero/annotation_scripts/KeyVal_from_csv.py:172:80: E501 line too long (81 > 79 characters)
./omero/annotation_scripts/KeyVal_from_csv.py:174:25: E128 continuation line under-indented for visual indent
./omero/annotation_scripts/KeyVal_from_csv.py:179:80: E501 line too long (88 > 79 characters)
./omero/annotation_scripts/KeyVal_from_csv.py:181:25: E128 continuation line under-indented for visual indent

@will-moore
Copy link
Member

Testing on a small csv file worked fine for me. I'm not going to do extensive testing on lots of csvs, but I can confirm with the test csv (on issue above), the original code doesn't parse the file correctly but this code does.

Screenshot 2023-09-07 at 12 16 29

@JensWendt
Copy link
Contributor Author

if parsing fails at 50% of the file, maybe it makes sense to simply read ALL of the file for the final try? Hopefully this won't be needed very often, but if it is then I expect most users would be willing to wait a tiny bit longer to give them the best chance of correctly reading the file?

Yes, and no. I think that might be a general discussion. Why not immediatly read 100% and take the safest route?
Admittedly, the checkpoints are arbitrary and not chosen with a sound reasoning.
But we are talking about timespans in the single digit milliseconds (according to %%timeit in the juypter notebook Sandbox that I use for trying this out). So it might not be too important where to put the checkpoints?

But, if you have strong feelings about this then I will kill the 75% checkpoint.

@JensWendt
Copy link
Contributor Author

Testing on a small csv file worked fine for me. I'm not going to do extensive testing on lots of csvs, but I can confirm with the test csv (on issue above), the original code doesn't parse the file correctly but this code does.

Yes, I did quite some testing, primarily with fringe .csvs where lots of cells where missing or other delimiters where intermingled.
But in the end we will have to rely on the community as testers and if something weird comes up I will try and fix it quickly.

@JensWendt
Copy link
Contributor Author

bump

@will-moore
Copy link
Member

Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions?
But I don't have any evidence or stats on how much more effective it would be, so happy to go for what you've got.

@will-moore will-moore merged commit ce9f5c9 into ome:develop Sep 29, 2023
1 check passed
@will-moore
Copy link
Member

Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at https://ome-contributing.readthedocs.io/en/latest/cla.html
This is a requirement to cover all of OME's GPL-licensed projects
Many thanks!

@JensWendt
Copy link
Contributor Author

Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions? But I don't have any evidence or stats on how much more effective it would be, so happy to go for what you've got.

Would you be okay with another fourth nested try for reading the whole file?
I mean it seems I will be touching the script again soon anyways.

@JensWendt
Copy link
Contributor Author

Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at ome-contributing.readthedocs.io/en/latest/cla.html This is a requirement to cover all of OME's GPL-licensed projects Many thanks!

here you go:
Binder1.pdf

@will-moore
Copy link
Member

Great, thanks for that.

This PR will be included in an upcoming OMERO.server release.
Unfortunately the scripts don't get released unless we release the server, which is a limitation. But feel free to continue to improve the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants