-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add disk caching #90
Add disk caching #90
Conversation
LGTM - does it work? We'll want to integrate with the CLI, have you thought about how to configure the cache dir? (I was think appdirs might be a good approach, ultimately?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment about cache version
Easiest way to get a uuid is to save to tempfile and read back in |
I've modified this to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should merge this so we have it, but it's probably overkill in a few different ways.
-
diskcache is really quite complex. After spending 10 mins with the docs, it looks like it's using relational DB plus a bunch of stuff around timing out keys. We really don't need this, and it also means we'd have to be much more careful about managing opening and closing the cache than we are here. We could just store the files (as CSVs, even) based on their keys and it would be much simpler and easier to maintain.
-
The key generation policy is clever, but expensive to run and hard to test. Just incrementing a number when we change the dataframe format isn't that hard, and it'll make key generation fast.
cache.py
Outdated
try: | ||
os.makedirs(cache_dir) | ||
logger.info(f"Set cache_dir to {cache_dir}") | ||
except OSError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also happen, e.g. if there wasn't sufficient permissions, or if it already existed and was a file, right? So, the log error will be misleading.
Why not use the pathlib API directly?
cache_dir.mkdir(exist_ok=True, parents=True)
I think this takes care of all the required corner cases and raises a sensible error on weirdnesses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, much simpler. Fixed.
@property | ||
def file_uuid(self): | ||
return self.ts.file_uuid | ||
|
||
@cached_property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so there's two layers of caching here. I had to think a bit about what this means - so the first time the property is called, we fall through to the disk_cache version, and then afterwards (within that process) it'll be cached by functools.
From the diskcache docs: "Cache objects are thread-safe and may be shared between threads. Two Cache objects may also reference the same directory from separate threads or processes. In this way, they are also process-safe and support cross-process communication." So I don't think we need to worry about opening and closing
Yeah, fair point. I replaced it with a version string. |
Did you push the changes here @benjeffery? |
Whoops, should be there now! |
In principle - in practise, sqlite hasn't worked well in my experience without carefully handling the context managers. Let's try it out and see anyway. |
WIP. Currently the test tree sequence has no UUID so trying to find a way to get one.