Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk caching #90

Merged
merged 1 commit into from
Sep 19, 2023
Merged

Add disk caching #90

merged 1 commit into from
Sep 19, 2023

Conversation

benjeffery
Copy link
Member

@benjeffery benjeffery commented Sep 12, 2023

WIP. Currently the test tree sequence has no UUID so trying to find a way to get one.

@jeromekelleher
Copy link
Member

LGTM - does it work?

We'll want to integrate with the CLI, have you thought about how to configure the cache dir? (I was think appdirs might be a good approach, ultimately?)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about cache version

model.py Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

Easiest way to get a uuid is to save to tempfile and read back in

@benjeffery
Copy link
Member Author

I've modified this to use appdirs and to cache based on the hash of files of functions in the calling tree of the cached function. This means that changes to pages code won't trigger an invalidation, which will help a lot with dev.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge this so we have it, but it's probably overkill in a few different ways.

  1. diskcache is really quite complex. After spending 10 mins with the docs, it looks like it's using relational DB plus a bunch of stuff around timing out keys. We really don't need this, and it also means we'd have to be much more careful about managing opening and closing the cache than we are here. We could just store the files (as CSVs, even) based on their keys and it would be much simpler and easier to maintain.

  2. The key generation policy is clever, but expensive to run and hard to test. Just incrementing a number when we change the dataframe format isn't that hard, and it'll make key generation fast.

cache.py Outdated
try:
os.makedirs(cache_dir)
logger.info(f"Set cache_dir to {cache_dir}")
except OSError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also happen, e.g. if there wasn't sufficient permissions, or if it already existed and was a file, right? So, the log error will be misleading.

Why not use the pathlib API directly?

cache_dir.mkdir(exist_ok=True, parents=True)

I think this takes care of all the required corner cases and raises a sensible error on weirdnesses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, much simpler. Fixed.

cache.py Outdated Show resolved Hide resolved
@property
def file_uuid(self):
return self.ts.file_uuid

@cached_property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so there's two layers of caching here. I had to think a bit about what this means - so the first time the property is called, we fall through to the disk_cache version, and then afterwards (within that process) it'll be cached by functools.

@benjeffery
Copy link
Member Author

  1. diskcache is really quite complex. After spending 10 mins with the docs, it looks like it's using relational DB plus a bunch of stuff around timing out keys. We really don't need this, and it also means we'd have to be much more careful about managing opening and closing the cache than we are here. We could just store the files (as CSVs, even) based on their keys and it would be much simpler and easier to maintain.

From the diskcache docs: "Cache objects are thread-safe and may be shared between threads. Two Cache objects may also reference the same directory from separate threads or processes. In this way, they are also process-safe and support cross-process communication." So I don't think we need to worry about opening and closing

  1. The key generation policy is clever, but expensive to run and hard to test. Just incrementing a number when we change the dataframe format isn't that hard, and it'll make key generation fast.

Yeah, fair point. I replaced it with a version string.

@jeromekelleher
Copy link
Member

Did you push the changes here @benjeffery?

@benjeffery
Copy link
Member Author

Whoops, should be there now!

@jeromekelleher jeromekelleher merged commit e85d3b7 into tskit-dev:main Sep 19, 2023
3 checks passed
@jeromekelleher
Copy link
Member

From the diskcache docs: "Cache objects are thread-safe and may be shared between threads. Two Cache objects may also reference the same directory from separate threads or processes. In this way, they are also process-safe and support cross-process communication." So I don't think we need to worry about opening and closing

In principle - in practise, sqlite hasn't worked well in my experience without carefully handling the context managers. Let's try it out and see anyway.

@benjeffery benjeffery deleted the caching branch September 20, 2023 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants