-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Higher performance "remote" validation #226
Comments
Note, the PR to address this discussed feature was merged in #233. |
Yes Ashley, perhaps to address the third question you raised yourself? :) I understand from the PR that there is no option to modify the remote source data graph. I am a heavy user of your Advanced Features implementation and would love to see performance improvements in that area. Modifying data in a remote source data graph can have a better performance for larger datasets, as compared to in-memory handling. What are your thoughts on this? :) |
You're right. When that PR was merged there was a deliberate choice to make the sparql-mode strictly read-only. There are two major reasons for that:
|
Indeed, the concept of remote endpoints can makes things quite complex taking into account the variety of triple stores and services that can be found nowadays. Would the option of supporting a remote datagraph that is facilitated only by RDFLib be helpful here, or am I missing the point (and I fear I do)? For me personally, I am using PyShacl and RDFlib for research, prototyping and demonstrations, and not for production purposes. That does not take away the importance of these two libraries for me, they are instrumental for me in getting things done. Anything that could help and improve showing the powers of SHACL through the use of PyShacl and RDFlib is a good thing in my book. Performance is one of the challenges, especially as I tend to work with many focus nodes & complex rules in a SHACL run. Correctness of the engine is another focal point. |
Indeed. The point of enabling remote validation is for the cases where the datagraph is already in another (non-RDFLib) system (eg. GraphDB, Stardog, Jena, Neptune, etc). If the datagraph is loaded in a remote rdflib instance, then PySHACL could be run directly in-memory in that system, no need for remote validation. This feature is for those who have moved out of the research/prototype phase, and into production usage. It is common to do dev and prototype work with RDFLib in-memory store, and move to a high performance triplestore for production use. The merged changes are for enabling running PySHACL against those backends. I understand every usecase is different, but in your case of doing research, prototyping, and demonstrations, I question why the ability to validate (and potentially modify) data on remote SPARQL service endpoints is required. |
This is something I've been thinking about for a while, since it was originally introduced in #174
The issue is PySHACL is primarily designed to run on Graphs in memory using the RDFLib memory store. There are two primary reasons for this:
These two concerns do not translate well to "remote" graphs, where remote means graphs that are not in a RDFlib local store, they live in a Graph Store service, and are accessed via a SPARQL endpoint. This can be the case if you're validating against a
sparqlstore
orsparqlconnector
graph in RDFLib, or using aSparqlWrapper()
on your graph.In the remote case, it is not efficient and often not desirable (or not possible) to create a full in-memory working-copy of the remote graph into a memory-backed rdflib graph. And it is very bad for performance if we're running atomic graph lookup operations via the SPARQL connector, because this results in tens or hundreds of individual synchronous SPARQL queries executed against the remote graph for each constraint evaluated.
So I'm proposing a new mode of operation for PySHACL, some kind of "SPARQL-optimised" form, or "remote" mode that will cause PySHACL to use purpose-build SPARQL queries to perform validation instead of RDFLib graph operations. This will be an implementation of the "driver only" interpretation of PySHACL as proposed in #174. The key distinction being this new mode will not replace the normal operating mode of PySHACL, and will not affect performance for users who primarily use in-memory graph validation.
There are some questions to think about:
sparqlconnector
,sparqlstore
orSparqlWrapper
graph. Can we simply use ahttps://
SPARQL endpoint as the graph argument on the commandline and have it work automatically?The text was updated successfully, but these errors were encountered: