-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lazy Union
operation
#1557
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1557 +/- ##
==========================================
+ Coverage 88.45% 88.95% +0.50%
==========================================
Files 362 364 +2
Lines 27455 33179 +5724
Branches 3705 3723 +18
==========================================
+ Hits 24285 29516 +5231
- Misses 1938 2431 +493
Partials 1232 1232 ☔ View full report in Codecov by Sentry. |
src/engine/Union.cpp
Outdated
IdTable{getResultWidth(), allocator()}, | ||
_columnOrigins); | ||
} else { | ||
std::vector<size_t> permutation = computePermutation<true>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm almost certain that for the left table this permutation will always be the identity function, so it might only be required to add columns for undef in this case
} | ||
permutation[targetColIdx] = originIndex; | ||
} | ||
return permutation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospective I'm not quite sure if there's an actual performance benefit to compute the permutation once and reuse it for all id tables or if it would be possible to just inline it into the algorithm that does the actual swapping, saving the memory allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to just store pairs of indices of swap operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perfectly fine as is, the assumption always is, that columns and blocks are sufficiently large, s.t. you can do something mildly expensive for each of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial questions.
} | ||
permutation[targetColIdx] = originIndex; | ||
} | ||
return permutation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perfectly fine as is, the assumption always is, that columns and blocks are sufficiently large, s.t. you can do something mildly expensive for each of them.
src/engine/Union.cpp
Outdated
// i + 1 because once everything expect the last column is in the correct | ||
// position, the last column is already in the correct position so the last | ||
// iteration would always swap the last column with itself. | ||
for (size_t i = 0; i + 1 < permutation.size(); ++i) { | ||
size_t ind = permutation[i]; | ||
while (ind < i) { | ||
ind = permutation[ind]; | ||
} | ||
idTable.swapColumns(i, ind); | ||
} | ||
return idTable; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure that the IdTable class has a setColumnSubset
function or something, that can be used to apply a permutation, saving you the logic here + testing it thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it reallocates unnecessary memory (by creating a second vector/array and replacing the original one) instead of swapping the elements in place like it's done here, but it definitely works.
src/engine/Union.cpp
Outdated
if (result2->isFullyMaterialized()) { | ||
co_yield computeUnion(IdTable{getResultWidth(), allocator()}, | ||
result2->idTable(), _columnOrigins); | ||
} else { | ||
std::vector<size_t> permutation = computePermutation<false>(); | ||
for (IdTable& idTable : result2->idTables()) { | ||
co_yield transformToCorrectColumnFormat(std::move(idTable), permutation); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite some code duplication here. With my above proposed change, you can write a helper,
that can be parametrized on result1/2
and the corresponding permutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. I could write a helper function but this would wrap the whole thing in another generator object which I was trying to avoid here. Of course, most likely the overhead wouldn't be too big here and in the future compilers will most likely be smart enough to compile the layer away. But this is my reason why I didn't add a helper function here and instead tried to keep the repeated code as short as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a very minor readability suggestion is remaining
src/engine/Union.cpp
Outdated
std::vector<ColumnIndex> permutation(_columnOrigins.size()); | ||
for (size_t targetColIdx = 0; targetColIdx < _columnOrigins.size(); | ||
++targetColIdx) { | ||
ColumnIndex originIndex = _columnOrigins.at(targetColIdx)[left ? 0 : 1]; | ||
if (originIndex == NO_COLUMN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don'r resize, but maximally reserve the permutation
, then do a range-for over the _columnOrigins
with push_back
on the permutation
.
That makes the code a little more readable.
And you can precompute the resIdx = left?0:1
then you don't repeat this pattern (less error-prone).
Conformance check passed ✅No test result changes. |
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Allow the
Union
operation to lazily output its result.