Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update file dependencies for up-to-date tasks. #431

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tillahoffmann
Copy link

This PR updates file dependencies in the doit database even if the task is already up to date. The change improves performance for large files under certain circumstances.

Consider the following task which simply copies large_file.txt to output.txt.

def task_copy():
    return {
        "actions": ["cp large_file.txt output.txt"],
        "targets": ["output.txt"],
        "file_dep": ["large_file.txt"],
    }

The first time doit runs, it saves the timestamp, size, and md5 hash. On the second run, doit smartly skips calculating the md5 hash of large_file.txt because the timestamps match. So far so good.

Now suppose the timestamp changes but the content does not. This might happen if we delete an intermediate file which is then regenerated. On the second run, doit will evaluate the md5 on large_file.txt and skip the task because it's up to date--as expected. But it won't update the timestamp in the database. So every time we run doit, it'll evaluate the md5 hash of large_file.txt.

This PR ensures the file dependencies are updated in the database even if the task is already up to date. Here's a concrete example using touch to update the timestamp. I've modified the check_modified function to report some debugging information (see end of description for details).

$ (master) rm -f .doit.db  # Start clean.
$ (master) doit
.  copy
$ (master) doit
-- copy
$ (master) touch large_file.txt
$ (master) doit
large_file.txt was modified at 15:53:09.664308; expected 15:51:36.076443
-- copy
$ (master) doit  # Evaluates md5 hash again (and will indefinitely).
large_file.txt was modified at 15:53:09.664308; expected 15:51:36.076443
-- copy
$ (check_modified) rm -f .doit.db  # Start clean.
$ (check_modified) doit
.  copy
$ (check_modified) doit
-- copy
$ (check_modified) touch large_file.txt
$ (check_modified) doit
large_file.txt was modified at 15:51:36.076443; expected 15:49:30.170537
-- copy
$ (check_modified) doit  # Does not evaluate md5 hash again (updated timestamp saved in previous run).
-- copy

Updated check_modified to report debug information.

    def check_modified(self, file_path, file_stat, state):
        """
        Check if file in file_path is modified from previous "state".
        """
        timestamp, size, file_md5 = state

        # 1 - if timestamp is not modified file is the same
        if file_stat.st_mtime == timestamp:
            return False

        from datetime import datetime
        print(f"{file_path} was modified at {datetime.fromtimestamp(file_stat.st_mtime).time()}; "
              f"expected {datetime.fromtimestamp(timestamp).time()}")

        # 2 - if size is different file is modified
        if file_stat.st_size != size:
            return True

        # 3 - check md5
        return file_md5 != get_file_md5(file_path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant