Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USPTO images missing for the "small" train script #13

Open
rytheranderson opened this issue Jan 9, 2024 · 2 comments
Open

USPTO images missing for the "small" train script #13

rytheranderson opened this issue Jan 9, 2024 · 2 comments

Comments

@rytheranderson
Copy link

Hello, I've recently been attempting to retrain MolScribe using the scripts you provide in scripts. First of all, thanks for providing all your data and training scripts, extremely helpful.

Second, when running train_uspto_joint_chartok.sh I get a series of missing image warnings like:

[ WARN:[email protected]] global loadsave.cpp:248 findDecoder imread_('data/uspto_mol/2002/20020723/US06423704-20020723/US06423704-20020723-C00135.TIF'): can't open/read file: check file path/integrity

I downloaded the ZIP from the link provided in the README: https://www.dropbox.com/s/3podz99nuwagudy/uspto_mol.zip?dl=0, and unzipped it into data. This is not an issue when running train_uspto_joint_chartok_1m680k.sh. The problem is the uspto_mol/train_200k.csv has paths to images not provided in the ZIP archive.

It would be good to be able to run the smaller training set for quicker comparisons to your saved checkpoint. Let me know if this is fixable. Thanks for your time and this model!

@thomas0809
Copy link
Owner

Sorry for the late reply. In our paper, we only keep the model trained with 1M synthetic data and 680K patent data. Therefore only these data are released and we encourage to use them for future comparison.

If you still want that 200K data, please send me an email at [email protected]. I can give you a link for private download.

@rytheranderson
Copy link
Author

Thanks for the explanation, the 200k data isn't necessary for me. But, it may be good to make a note that it is not released in the 200k training script or README, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants