Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] SPFS Safe Deletion #1011

Open
rydrman opened this issue Mar 22, 2024 · 4 comments
Open

[RFC] SPFS Safe Deletion #1011

rydrman opened this issue Mar 22, 2024 · 4 comments
Labels
agenda item Items to be brought up at the next dev meeting

Comments

@rydrman
Copy link
Collaborator

rydrman commented Mar 22, 2024

Background and Problem Space

On our server, we have a race condition with respect to the spfs clean process. Consider the following scenario:

  1. An manifest untagged that contains object OLD
  2. A clean process starts, collects all tags and begins looking for orphaned objects
  3. A manifest upload starts that contains OLD which appears to exist on the server
  4. The clean process removes OLD as it is an orphan
  5. The upload process completes and tags a manifest containing OLD (which is now corrupt as OLD is missing)

The use of object age in the cleaner is not helpful here, because OLD can be of any age.

Proposal

I propose a safe-deletion process for spfs that works in two stages where data is first marked for deletion before removing it at a later time. This two stage removal process would allow administrators to leverage a second age-out window for deleted data. This second window would need to be tuned for each workflow to fully eliminate race conditions, but a safe and reasonable default would suffice for all currently known cases.

Step-by-Step

  1. Update the object and payload repositories to be able to "safe-delete" objects by adding a ".deleted" suffix to the file (or similar).
    At this stage the repository would appear to no longer contain the object via has_object. If the object/payload was read, though, the repository would recognize the deleted instance and restore it instead of failing. Crucially, this restore should make the object "new" again in the eyes of the cleaner.
  2. Update the cleaner (and clean cli) to use this safe deletion by default
  3. And add a new phase/process to the cleaner that look for aged-out ".deleted" data to remove for good.
@rydrman rydrman added the agenda item Items to be brought up at the next dev meeting label Mar 22, 2024
@jrray
Copy link
Collaborator

jrray commented Mar 23, 2024

Yeah that's an interesting case that the age-based protections don't help with. I don't think the .deleted idea will really solve this race condition. It does possibly make it less likely in your workflow, but I believe it will still be possible to get into a situation where a non-garbage object remains marked deleted for an indefinite period, and if that object is not read for whatever reason, it will still be subject for deletion at some point. It also introduces a lot of complexity -- more states that objects can be in -- and causes read operations to become write operations.

What I think the fundamental race problem is that there is not a way to write both a manifest object and its children objects into the database atomically. So there is a window of time between when the child objects are checked for existence / written (and can be considered garbage) and when the manifest object is written that creates the hard reference to those child objects. This window actually extends up to the point where a tag is written that references the manifest(s).

I suggest a different approach to tackle the race more directly using a cooperative locking mechanism. What is needed is an extra way to track object liveness that is separate from walking references from tags. Here's what would happen:

  • Any operation intending to write new objects to the db would create a "transaction" where some exclusive lock is held.
  • If writing a manifest, for example, all the child object digests are added to the transaction.
  • The lock can be released as soon as all the digests are protected; the objects themselves do not need to be written to the db yet.
  • spfs clean is updated to respect these transactions -- no objects present in any transaction(s) are considered garbage. It cannot delete any objects without holding the transaction lock.

The lock would be short lived and still allow concurrency for writes. spfs clean would have to block writes after it has collected a list of objects it would like to delete to ensure none of those objects become live again.

@rydrman
Copy link
Collaborator Author

rydrman commented Mar 27, 2024

I'm happy to explore the idea of transactions, but it sounds like you are suggesting a global lock on the whole DB. That makes me nervous, but maybe a more detailed discussion of the implementation would be good.

One consideration for us is that the clean operation can take a significant amount of time on our repository, and it's not reasonable to ask the server to stop accepting requests during that period because we have processes writing new data at most hours of the day, now.

spfs clean would have to block writes after it has collected a list of objects it would like to delete to ensure none of those objects become live again.

It seems to me that this point is not actually true, since you would need to block writes even while collecting the list of objects lest you select something for deletion that was re-added while scanning (the original problem that spawned this).

As I think about the transactions, I'm hard pressed to see how we actually rid ourselves of a similar race condition, but I'm keep to keep talking about it.

I arrived ad the .deleted approach because our proposed solution internally was to have two repositories. The second being a graveyard and a fallback to the primary one, so that we could move "deleted" objects and have them age-out of the graveyard if they have been deleted long enough. While considering all of the re-implementing we would need to do of the cleaner in order to have it transfer instead of remove, it occurred to me that this second repository could basically be embedded transparently in the first. I know it's not perfect, but it allows clean and check to be combined to create a much more robust maintenance workflow provided that the full age-out and check intervals could be guaranteed to be longer than any one write operation

@jrray
Copy link
Collaborator

jrray commented Mar 27, 2024

spfs clean would have to block writes after it has collected a list of objects it would like to delete to ensure none of those objects become live again.

It seems to me that this point is not actually true, since you would need to block writes even while collecting the list of objects lest you select something for deletion that was re-added while scanning (the original problem that spawned this).

This is not true. spfs clean is only allowed to delete objects while holding the lock. It is not allowed to delete any objects added to the no-delete list. It is free to gather information about objects it wants to delete without holding the lock, but it must respect the no-delete list once attempting to delete them. Objects cannot be added to the no-delete list without holding the lock.

@rydrman
Copy link
Collaborator Author

rydrman commented May 22, 2024

From the meeting today:

  • clean is also a very heavy operation especially on large repositories that use shared infrastructure such as NFS. Is there a way we can (though this process) allow the clean process to run slowly in the background

  • Possibly create a new repo that layers over the first, making the original one readonly. From there, you could walk the original repo and copy-up any attached data (hopefully by hard link). Though you would still need to account for new writes, ensuring that all attached objects for new writes are also copied up. You likely still need some way to ensure that write operations are not in-flight when the split happens. Maybe the same write-ahead log idea as above.

  • Likely this needs to be handled internally to the repository so that there's no physical movement of NFS repositories that needs to be communicated to other clients.

  • A clean journal that logs when something is considered garbage as well as when, and then tunable criteria for how many times and/or how long something needs to be considered garbage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agenda item Items to be brought up at the next dev meeting
Projects
None yet
Development

No branches or pull requests

2 participants