Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-4232 performance issue in VALUES clause query #4330

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

abrokenjester
Copy link
Contributor

GitHub issue resolved: #4232

Briefly describe the changes proposed in this PR:

  • ensure VALUES clause always covers entire group graph pattern. This is the generically correct way to parse VALUES clauses. An optimizer can potentially look at the ordering in the algebra to push the values clause down into the join tree (by inspecting which parts of the tree have variables bound in the VALUES clause).
  • adding benchmarks

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@hmottestad
Copy link
Contributor

Results from my laptop:

Benchmark                                          Mode  Cnt    Score   Error  Units
SPARQLValuesClauseBenchmark.simpleEquivalentQuery  avgt    5    0.606 ± 0.007  ms/op
SPARQLValuesClauseBenchmark.valuesOptionalQuery    avgt    5  298.065 ± 6.825  ms/op

The query with the VALUES clause uses ~300ms per query while the other query uses ~0.6ms.

@abrokenjester
Copy link
Contributor Author

On my laptop:

Benchmark                                          Mode  Cnt    Score    Error  Units
SPARQLValuesClauseBenchmark.simpleEquivalentQuery  avgt    5    0.956 ±  0.093  ms/op
SPARQLValuesClauseBenchmark.valuesOptionalQuery    avgt    5  452.636 ± 23.386  ms/op

Looks like you have a slightly better laptop :)

@abrokenjester
Copy link
Contributor Author

Query explanation simple query:

Projection (resultSizeActual=505)
╠══ProjectionElemList
║     ProjectionElem "parent"
║     ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=505)
   ├──StatementPattern (resultSizeEstimate=504, resultSizeActual=504)
   │     Var (name=parent)
   │     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
   │     Var (name=_const_7e619dcc_uri, value=http://www.w3.org/2002/07/owl#Class, anonymous)
   └──Join (JoinIterator) (resultSizeActual=2)
      ╠══StatementPattern (costEstimate=50, resultSizeEstimate=9.9K, resultSizeActual=2)
      ║     Var (name=child)
      ║     Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
      ║     Var (name=parent)
      ╚══StatementPattern (costEstimate=1, resultSizeEstimate=504, resultSizeActual=2)
            Var (name=child)
            Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
            Var (name=_const_7e619dcc_uri, value=http://www.w3.org/2002/07/owl#Class, anonymous)

Query explanation values clause query:

Projection (resultSizeActual=505)
╠══ProjectionElemList
║     ProjectionElem "parent"
║     ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=505)
   ├──Join (JoinIterator) (resultSizeActual=504)
   │  ╠══BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class]]) (costEstimate=1, resultSizeEstimate=1, resultSizeActual=1)
   │  ╚══StatementPattern (costEstimate=504, resultSizeEstimate=504, resultSizeActual=504)
   │        Var (name=parent)
   │        Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
   │        Var (name=type, value=http://www.w3.org/2002/07/owl#Class)
   └──Join (JoinIterator) (resultSizeActual=2)
      ╠══StatementPattern (costEstimate=11, resultSizeEstimate=508, resultSizeActual=254.0K)
      ║     Var (name=child)
      ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
      ║     Var (name=type)
      ╚══StatementPattern (costEstimate=1, resultSizeEstimate=9.9K, resultSizeActual=2)
            Var (name=child)
            Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
            Var (name=parent)

@abrokenjester
Copy link
Contributor Author

I still can't see exactly what goes wrong, other than that it just seems to be a pathological case in terms of the data shape. I'm starting to wonder if we should have an optimizer that just duplicates the bindingsetassignment clause into the right arg of the left join. We can't pre-bind values in the right arg of the left join, but perhaps if we optimize to something like this:

Projection (resultSizeActual=505)
╠══ProjectionElemList
║     ProjectionElem "parent"
║     ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=505)
   ├──Join (JoinIterator) (resultSizeActual=504)
   │  ╠══BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class]]) (costEstimate=1, resultSizeEstimate=1, resultSizeActual=1)
   │  ╚══StatementPattern (costEstimate=504, resultSizeEstimate=504, resultSizeActual=504)
   │        Var (name=parent)
   │        Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
   │        Var (name=type, value=http://www.w3.org/2002/07/owl#Class)
   └──Join (JoinIterator) (resultSizeActual=2)
      ╠══ Join
           -> BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class]]) (costEstimate=1, resultSizeEstimate=1, resultSizeActual=1)
           ╠══StatementPattern (costEstimate=11, resultSizeEstimate=508, resultSizeActual=254.0K)
           ║     Var (name=child)
           ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
           ║     Var (name=type)
           ╚══StatementPattern (costEstimate=1, resultSizeEstimate=9.9K, resultSizeActual=2)
                   Var (name=child)
                  Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
                 Var (name=parent)

(ugly manual editing by me to get the idea across)

I'm not yet sure if that is something that is generally legal tbh. Just recording my thoughts for the next time I revisit.

@hmottestad
Copy link
Contributor

In this case I think it's an issue with the join optimizer. The join optimizer doesn't really need to be limited by scoping issues when it calculates cardinality, as long as any optimizations are legal.

The join optimizer could assume that the binding set assignment will apply and use it to both calculate the size and to ignore the positive effect that binding the ?type variable early will have.

abrokenjester and others added 3 commits January 21, 2023 10:08
This is the generically correct way to parse VALUES clauses. An
optimizer can potentially look at the ordering in the algebra to push
the values clause down into the join tree (by inspecting which parts of
the tree have variables bound in the VALUES clause).
This benchmark uses generated data conforming to the query pattern, and
executes performance tests on both the variant with a VALUES clause, and
(as a baseline) the simple equivalent query.

Unfortunately, sofar I have been unable to reproduce any significant performance difference.
@abrokenjester abrokenjester force-pushed the GH-4232-inlinedata-parsing branch from 60e2402 to 33ce6d6 Compare January 20, 2023 21:08
@abrokenjester
Copy link
Contributor Author

abrokenjester commented Jan 20, 2023

In this case I think it's an issue with the join optimizer. The join optimizer doesn't really need to be limited by scoping issues when it calculates cardinality, as long as any optimizations are legal.

The join optimizer could assume that the binding set assignment will apply and use it to both calculate the size and to ignore the positive effect that binding the ?type variable early will have.

I don't quite understand what you're getting at. As far as I can tell the problem is that the join optimizer calculates an initial cost for the (?child, type, ?type) pattern that is lower than it should be due to its base cardinality being lower than that of the (?child, subClassOf, ?parent) pattern. I'm struggling to see how to accurately compensate here though, to be honest. How can we inform it at the time of cost calculation that ?type is different from ?parent because the former is set by a BSA?

@abrokenjester abrokenjester force-pushed the GH-4232-inlinedata-parsing branch from 33ce6d6 to 7794d0f Compare January 20, 2023 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression for queries using OPTIONAL and VALUES in 3.7.4/4.1.0 vs 2.5.5
3 participants