-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair: describe ring per table and without cache #3718
Conversation
On the second thought, it is also possible to keep repair plan organized per keyspace / table. It might require less changes. |
e070900
to
573d0f5
Compare
We could also change progress to keep stats of only currently repaired table, but this can be a part of another PR. |
Dtests passed except for one, not connected test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My brain compiler is too weak to verify that all changes made to generator won't break the repair process.
The code itself (of generator.go) looks good.
I believe that integration-tests of the repair should catch pottential bugs. Do you agree with me ? Maybe it's worth to execute the SCT sanity against this branch too ?
@karol-kokoszka It seems like additional fields in Ring added in #3747 would be useful in this PR. Having the dc -> rf mapping would make keyspace filtering and max parallel calculation way cleaner. I will refactor and rebase this PR on top of master when #3747 gets merged. |
MaxRingParallel is a cleaner and more comprehensible way of calculating max repair parallelism based on Ring replication strategy.
ShouldRepairRing is a cleaner and more comprehensible way of checking if Ring should be repaired.
This change increases readability.
This change increases readability.
This approach stores only a single ring description at a time (instead of all of them) and makes it easier to move to per table ring descriptions.
573d0f5
to
f5bff87
Compare
@karol-kokoszka I decided to fix #3745 in this PR as it is used in refactored code. I also tried to address all comments above. Sorry for a large PR, but could you please take a look at it again? I have some problems with building and running tests on jenkins. I will retry them tomorrow. |
1129e55
to
f5bff87
Compare
But you closed the PR. |
Yes, I mentioned above that I decided to have those fixes here as they are important for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@karol-kokoszka are we ok to merge or do we want to investigate the |
I checked it and looks that it failed on your run. Previous builds failed due to spot instance being terminated. |
|
Merge it, then please check the output of this job execution after merging to master. If it will appear once again without the exact reason in logs (as it is now), please fill the ticket in https://github.com/scylladb/scylla-cluster-tests |
Actions look good after merging to master. |
Right now, repair plan contains descriptions of all repaired rings which are stored at the keyspace level. This has two main issues:
The main goal of this PR is remove ring descriptions from plan and query them when needed during an actual repair.
In order to achieve that, generator responsibility is passed to generator and tableGenerator. Generator is responsible for querying ring description and creating tableGenerators, which take care of repairing given table.
This introduces a lot of code refactoring, so in order to make repair logic more comprehensible I added:
Currently, determining max parallel or if a keyspace should be repaired is convoluted as it requires traversing all token ranges. The changes to DescribeRing replication strategy calculation make it possible to do it by looking at the replication strategy and replication factor, which is easier to understand, explain and test.
This PR tried not to touch repair progress manager, but it might be good to refactor it as well.
Fixes #3745