Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazy Union operation #1557

Merged
merged 10 commits into from
Oct 18, 2024
Merged

Implement lazy Union operation #1557

merged 10 commits into from
Oct 18, 2024

Conversation

RobinTF
Copy link
Collaborator

@RobinTF RobinTF commented Oct 15, 2024

Allow the Union operation to lazily output its result.

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.95%. Comparing base (81a9350) to head (bb544ba).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1557      +/-   ##
==========================================
+ Coverage   88.45%   88.95%   +0.50%     
==========================================
  Files         362      364       +2     
  Lines       27455    33179    +5724     
  Branches     3705     3723      +18     
==========================================
+ Hits        24285    29516    +5231     
- Misses       1938     2431     +493     
  Partials     1232     1232              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

IdTable{getResultWidth(), allocator()},
_columnOrigins);
} else {
std::vector<size_t> permutation = computePermutation<true>();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost certain that for the left table this permutation will always be the identity function, so it might only be required to add columns for undef in this case

}
permutation[targetColIdx] = originIndex;
}
return permutation;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospective I'm not quite sure if there's an actual performance benefit to compute the permutation once and reuse it for all id tables or if it would be possible to just inline it into the algorithm that does the actual swapping, saving the memory allocation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to just store pairs of indices of swap operations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perfectly fine as is, the assumption always is, that columns and blocks are sufficiently large, s.t. you can do something mildly expensive for each of them.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial questions.

src/engine/Union.cpp Show resolved Hide resolved
}
permutation[targetColIdx] = originIndex;
}
return permutation;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perfectly fine as is, the assumption always is, that columns and blocks are sufficiently large, s.t. you can do something mildly expensive for each of them.

Comment on lines 269 to 280
// i + 1 because once everything expect the last column is in the correct
// position, the last column is already in the correct position so the last
// iteration would always swap the last column with itself.
for (size_t i = 0; i + 1 < permutation.size(); ++i) {
size_t ind = permutation[i];
while (ind < i) {
ind = permutation[ind];
}
idTable.swapColumns(i, ind);
}
return idTable;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that the IdTable class has a setColumnSubset function or something, that can be used to apply a permutation, saving you the logic here + testing it thoroughly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it reallocates unnecessary memory (by creating a second vector/array and replacing the original one) instead of swapping the elements in place like it's done here, but it definitely works.

src/engine/Union.cpp Outdated Show resolved Hide resolved
Comment on lines 297 to 305
if (result2->isFullyMaterialized()) {
co_yield computeUnion(IdTable{getResultWidth(), allocator()},
result2->idTable(), _columnOrigins);
} else {
std::vector<size_t> permutation = computePermutation<false>();
for (IdTable& idTable : result2->idTables()) {
co_yield transformToCorrectColumnFormat(std::move(idTable), permutation);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite some code duplication here. With my above proposed change, you can write a helper,
that can be parametrized on result1/2 and the corresponding permutation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. I could write a helper function but this would wrap the whole thing in another generator object which I was trying to avoid here. Of course, most likely the overhead wouldn't be too big here and in the future compilers will most likely be smart enough to compile the layer away. But this is my reason why I didn't add a helper function here and instead tried to keep the repeated code as short as possible.

src/engine/Union.h Outdated Show resolved Hide resolved
test/UnionTest.cpp Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a very minor readability suggestion is remaining

Comment on lines 247 to 251
std::vector<ColumnIndex> permutation(_columnOrigins.size());
for (size_t targetColIdx = 0; targetColIdx < _columnOrigins.size();
++targetColIdx) {
ColumnIndex originIndex = _columnOrigins.at(targetColIdx)[left ? 0 : 1];
if (originIndex == NO_COLUMN) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don'r resize, but maximally reserve the permutation, then do a range-for over the _columnOrigins with push_back on the permutation.
That makes the code a little more readable.
And you can precompute the resIdx = left?0:1 then you don't repeat this pattern (less error-prone).

@sparql-conformance
Copy link

Copy link

sonarcloud bot commented Oct 18, 2024

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@joka921 joka921 merged commit e53d783 into ad-freiburg:master Oct 18, 2024
22 checks passed
@RobinTF RobinTF deleted the lazy-union branch October 18, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants