Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Iteration Algorithms #17

Open
adamchainz opened this issue Mar 21, 2015 · 4 comments
Open

Smart Iteration Algorithms #17

adamchainz opened this issue Mar 21, 2015 · 4 comments
Labels

Comments

@adamchainz
Copy link
Owner

The current algorithm used by SmartChunkedIterator is currently restricted to the one from pt-online-schema-change, with filtering added. This works badly for really sparse distributions of results.

There are other algorithms we can use, in fact listed on the pt-table-sync documentation. For example, the current algorithm is (approximately) there as 'chunk'. The 'nibble' strategy would work quite well for sparse queries - it's the same as pt-archiver, using LIMIT to determine how many get fetched.

algorithm could be another argument to SmartChunkedIterator - maybe even defaulting to 'auto' with some heuristic to pick between the algos.

@adamchainz adamchainz changed the title pt-archiver style iteration Smart Iteration Algorithms Dec 8, 2015
@Geekfish
Copy link

Hi @adamchainz

I know it's a bit of an old ticket, but I'd be interested in working on this.
Would you have any suggestions on a setup to test the efficiency of difference solutions outside of unit tests?

Many thanks for your great work!

@adamchainz
Copy link
Owner Author

I think you really need a big table with realistic fragmentation to investigate this properly. Also look in the source code for pt-table-sync.

Are you having any problems with the current algorithm?

@Geekfish
Copy link

We ran into some inefficiency when dealing with queries that have sparse results over big tables. This was due to the query conditions but also due to removed rows (with ids earlier in the table being more sparse for example).
We have found some workarounds for our cases (ex. not using chunks at all for queries with smaller counts) and the iterator still works nicely most of the time, but I was personally interested in a better solution for sparse distributions.

@adamchainz
Copy link
Owner Author

Yeah sparse results are difficult because there's no predicting when the density of results changes. What the current algorithm does is just react to this by decreasing the chunk size rapidly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants