Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Higher performance "remote" validation #226

Open
ashleysommer opened this issue Apr 10, 2024 · 5 comments
Open

[Discussion] Higher performance "remote" validation #226

ashleysommer opened this issue Apr 10, 2024 · 5 comments

Comments

@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 10, 2024

This is something I've been thinking about for a while, since it was originally introduced in #174

The issue is PySHACL is primarily designed to run on Graphs in memory using the RDFLib memory store. There are two primary reasons for this:

  1. PySHACL creates a copy of the target datagraph, to a new in-memory graph, to do operations on it, to avoid polluting the input graph
  2. PySHACL uses native RDFLib graph operations (eg, graph.subjects(), graph.objects(), graph.rest()) these are atomic graph operations that read directly from the underlying graph store, these operations are hand-built and hand-tweaked for each SHACL constraint to achieve maximum performance.

These two concerns do not translate well to "remote" graphs, where remote means graphs that are not in a RDFlib local store, they live in a Graph Store service, and are accessed via a SPARQL endpoint. This can be the case if you're validating against a sparqlstore or sparqlconnector graph in RDFLib, or using a SparqlWrapper() on your graph.

In the remote case, it is not efficient and often not desirable (or not possible) to create a full in-memory working-copy of the remote graph into a memory-backed rdflib graph. And it is very bad for performance if we're running atomic graph lookup operations via the SPARQL connector, because this results in tens or hundreds of individual synchronous SPARQL queries executed against the remote graph for each constraint evaluated.

So I'm proposing a new mode of operation for PySHACL, some kind of "SPARQL-optimised" form, or "remote" mode that will cause PySHACL to use purpose-build SPARQL queries to perform validation instead of RDFLib graph operations. This will be an implementation of the "driver only" interpretation of PySHACL as proposed in #174. The key distinction being this new mode will not replace the normal operating mode of PySHACL, and will not affect performance for users who primarily use in-memory graph validation.

There are some questions to think about:

  1. Could this be a commandline switch or validator argument? Something the user can switch on manually? Or should it be auto-detected if the user is passing in a sparqlconnector, sparqlstore or SparqlWrapper graph. Can we simply use a https:// SPARQL endpoint as the graph argument on the commandline and have it work automatically?
  2. As we're not creating a working-copy of the graph for the validation, does that mean we must avoid polluting the source graph? That means we cannot do any OWL/RDFS inferencing, no SHACL Rules can be applied in this mode, and SHACL functions must also be turned off (as these can pollute the graph too) in remote mode.
  3. Are there some cases when we do want to pollute the graph? Eg, trying to use PySHACL as a SHACL Rule engine, where you do want the new triples to appear in the source graph. This doesn't make sense to do in a in-memory local graph, but I see the utility of doing it on a remote.
@ashleysommer ashleysommer changed the title Higher performance "remote" validation [Discussion] Higher performance "remote" validation Apr 10, 2024
@ashleysommer
Copy link
Collaborator Author

Note, the PR to address this discussed feature was merged in #233.
Is there any reason to keep this discussion topic open?

@floresbakker
Copy link

Yes Ashley, perhaps to address the third question you raised yourself? :) I understand from the PR that there is no option to modify the remote source data graph. I am a heavy user of your Advanced Features implementation and would love to see performance improvements in that area. Modifying data in a remote source data graph can have a better performance for larger datasets, as compared to in-memory handling. What are your thoughts on this? :)

@ashleysommer
Copy link
Collaborator Author

ashleysommer commented Oct 2, 2024

@floresbakker

You're right. When that PR was merged there was a deliberate choice to make the sparql-mode strictly read-only.

There are two major reasons for that:

  1. The architecture of the PySHACL codebase was designed and built from the start to only be a SHACL Validator, and with the goal to never modify the user's DataGraph. This directive is in the W3C SHACL Specification, and PySHACL follows that strictly.
  • Since the release of PySHACL in 2017, there have been many users requesting the ability for various reasons to ignore the SHACL Spec and deliberately modify their DataGraph. This was hacky to implement, but there are now some unsupported workarounds in PySHACL to do this (aka, "inplace mode") but it is not recommended.
  • When SHACL Advanced-Features spec was released, the introduction of SHACL Rules meant users might want to run SHACL Rules expansion directly on the input DataGraph, to use PySHACL validation for only the side-effect of expanding their datagraph. Again, this takes advantage of hacky workarouds ("inplace mode") of the PySHACL Validator architecture.
  • The SPARQL-mode remote-DataGraph implementation is only in the PySHACL main codepath, meaning it is not available in "inplace" mode. Adding this would involve much more complexity PySHACL's graph management routines.
  1. Write-endpoints for SPARQL backends from different vendors are inconsistent. PySHACL uses the built-in rdflib SPARQLConnector backend to create a read-only graph-source based on a queryable SPARQL Endpoint, but often SPARQL services have a different endpoint for writing, and require specific write patterns eg, RDF4J API, transactions, locking, commits, POST vs GET, payload content type (raw vs x-www-urlencoded).
    Supporting all of those different write-endpoints kinds and patterns in PySHACL would be a nightmare, or require the use of a third-party dependency like SPARQLWrapper.

@floresbakker
Copy link

Indeed, the concept of remote endpoints can makes things quite complex taking into account the variety of triple stores and services that can be found nowadays. Would the option of supporting a remote datagraph that is facilitated only by RDFLib be helpful here, or am I missing the point (and I fear I do)?

For me personally, I am using PyShacl and RDFlib for research, prototyping and demonstrations, and not for production purposes. That does not take away the importance of these two libraries for me, they are instrumental for me in getting things done. Anything that could help and improve showing the powers of SHACL through the use of PyShacl and RDFlib is a good thing in my book. Performance is one of the challenges, especially as I tend to work with many focus nodes & complex rules in a SHACL run. Correctness of the engine is another focal point.

@ashleysommer
Copy link
Collaborator Author

ashleysommer commented Oct 7, 2024

a remote datagraph that is facilitated only by RDFLib be helpful here, or am I missing the point

Indeed. The point of enabling remote validation is for the cases where the datagraph is already in another (non-RDFLib) system (eg. GraphDB, Stardog, Jena, Neptune, etc). If the datagraph is loaded in a remote rdflib instance, then PySHACL could be run directly in-memory in that system, no need for remote validation.

This feature is for those who have moved out of the research/prototype phase, and into production usage. It is common to do dev and prototype work with RDFLib in-memory store, and move to a high performance triplestore for production use. The merged changes are for enabling running PySHACL against those backends.

I understand every usecase is different, but in your case of doing research, prototyping, and demonstrations, I question why the ability to validate (and potentially modify) data on remote SPARQL service endpoints is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants