Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put 0 as explicit lower bound in regex patterns #342

Open
mhrimaz opened this issue Aug 19, 2024 · 9 comments
Open

Put 0 as explicit lower bound in regex patterns #342

mhrimaz opened this issue Aug 19, 2024 · 9 comments

Comments

@mhrimaz
Copy link

mhrimaz commented Aug 19, 2024

@s-heppner mentioned in admin-shell-io/aas-specs#426 (comment) that zero is optional in regex quantifiers. I don't think that is true, at least in Java this doesn't work.

image

This occurs in language tags and probably other places: https://github.com/aas-core-works/aas-core-meta/blob/31d6afd348a86f43cc658cc5bffab49f1e47bd24/aas_core_meta/v3_1.py#L286C6-L286C46

@mristin
Copy link
Contributor

mristin commented Aug 19, 2024

@mhrimaz There is no single regex language, unfortunately. We correctly transpile the patterns in Java:
https://github.com/aas-core-works/aas-core3.0-java/blob/4aa354e023b8f003fff4706c4bf519e22fb09d38/src/main/java/aas_core/aas3_0/verification/Verification.java#L211

In most regex pattern languages (or better said, dialects?), zero is indeed optional. It all depends on which regex language and which regex engine you use.

Can you please elaborate a bit more on the issue?

@mhrimaz
Copy link
Author

mhrimaz commented Aug 19, 2024

Well, if you want to validate the shacl shape(https://github.com/admin-shell-io/aas-specs/blob/9a118059bea5e9cf12405db7c871b5d5869acaa4/schemas/rdf/shacl-schema.ttl#L39), using a java shacl validator like Apache Jena, the issue appears again.

So I think having 0 is indeed better—more explicit and probably less troublesome. Does adding zero cause any issues with other engines?

@mristin
Copy link
Contributor

mristin commented Aug 20, 2024

@mhrimaz:

So I think having 0 is indeed better—more explicit and probably less troublesome. Does adding zero cause any issues with other engines?

Not as far as I know. @sebbader-sap @sebbader @s-heppner can you please double-check and let me know how to change the schemas?

XSD and JSON Schema probably suffer from this issue in Java as well?

@mhrimaz please also note that the Java standard regex library runs in exponential time. For some patterns, you have to use another library. This is fixed in the aas-core-java (and aas-core-cpp).

@s-heppner
Copy link
Collaborator

I'd be very careful with any Regex changes, they seem to often make the problem worse than better.
However, I will bring this up in the respective IDTA working groups and will get back to you with an answer.

@mhrimaz
Copy link
Author

mhrimaz commented Aug 20, 2024

XSD and JSON Schema probably suffer from this issue in Java as well?

In official repo (https://github.com/admin-shell-io/aas-specs/blob/9a118059bea5e9cf12405db7c871b5d5869acaa4/schemas/json/aas.json#L39) and in code-gen repo (https://github.com/aas-core-works/aas-core-codegen/blob/e1d8c19a08afaa18485c78fe1ff90934daa5decd/test_data/jsonschema/test_main/aas_core_meta.v3/expected_output/schema.json#L39C66-L40C11) language regex have zero in regex qunatifier ({0,2})

Same in XSD. Apparently, the regexes are not the same in SHACL.

@mristin
Copy link
Contributor

mristin commented Aug 20, 2024

@mhrimaz do I understand correctly: this issue affects only SHACL?

@mristin mristin changed the title Invalid regex pattern for language tags Put 0 as explicit lower bound in regex patterns Aug 20, 2024
@mhrimaz
Copy link
Author

mhrimaz commented Aug 20, 2024

@mristin
Copy link
Contributor

mristin commented Aug 20, 2024

@mhrimaz thanks for fdigging into this! Aas-core-meta is fine as it is parsed, and Python requires no zeros.

I'll fix SHACL then soon, and @s-heppner can put it then on aas-specs repo.

mristin added a commit to aas-core-works/aas-core-codegen that referenced this issue Aug 22, 2024
We included the regex pattern as-is from the input. Instead, with this
patch, we parse it from the input and re-render it into the canonical
form so that many more regex engines can work with it.

For example, in the input, we omit the minimum bound 0 (*e.g.*,
``{,4}``), which breaks with the Java regex engine beneath
the SHACL validator. Now, the pattern is correctly rendered with an
explicit 0 (``{0,4}``).

Discovered in [aas-core-meta issue 342].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
@mristin
Copy link
Contributor

mristin commented Aug 22, 2024

@mhrimaz I made the fix in aas-core-codegen so that we use the canonical regex renderer. Can you please test on your side that the canonical patterns work as expected? Here's the new SHACL:
https://github.com/aas-core-works/aas-core-codegen/pull/517/files#diff-123eac332ab2df0595966e11a9b259234031ebeae72724978db116956ac5f7e0

Once you give me a go, I'll merge the pull request in aas-core-codegen, and then we can update aas-specs repository.

mristin added a commit to aas-core-works/aas-core-codegen that referenced this issue Aug 26, 2024
We included the regex pattern as-is from the input. Instead, with this
patch, we parse it from the input and re-render it into the form that we
also use in Java. We pick Java regex dialect as most SHACL validators in
the wild rely on Java as platform, so we decide to serve this user base
with priority.

For example, in the input meta-model specification, we omit the minimum
bound 0 (*e.g.*, ``{,4}``), which breaks with the Java regex engine
beneath the SHACL validator. Now, the pattern is correctly rendered with an
explicit 0 (``{0,4}``).

Discovered in [aas-core-meta issue 342].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
mristin added a commit to aas-core-works/aas-core-codegen that referenced this issue Aug 27, 2024
We included the regex pattern as-is from the input. Instead, with this
patch, we parse it from the input and re-render it into the form that we
also use in Java. We pick Java regex dialect as most SHACL validators in
the wild rely on Java as platform, so we decide to serve this user base
with priority.

For example, in the input meta-model specification, we omit the minimum
bound 0 (*e.g.*, ``{,4}``), which breaks with the Java regex engine
beneath the SHACL validator. Now, the pattern is correctly rendered with an
explicit 0 (``{0,4}``).

Discovered in [aas-core-meta issue 342].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
mristin added a commit to aas-core-works/aas-core-codegen that referenced this issue Aug 29, 2024
We included the regex pattern as-is from the input which caused problems
with the regex engines as the patterns in the meta-model are written in
a Python dialect (and assuming that the regex engine works on UTF-32
characters). However, most regex engines in the wild operating on SHACL
(*e.g.*, Java SHACL validators) use UTF-16 to represent the text and do
not support some parts of the Python regex dialect. For example, in
the input meta-model specification, we omit the minimum bound 0
(*e.g.*, ``{,4}``), which breaks with the Java regex engine beneath
the SHACL validator.

Instead, with this patch, we parse the pattern from the specification
and re-render it into the form that we also use in JSON Schema. We pick
JSON Schema regex dialect as most SHACL validators in the wild can deal
with it, in particular those based on Java as a platform. Hence, we
decide to serve this user base with priority.

Discovered in [aas-core-meta issue 342].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
mristin added a commit to aas-core-works/aas-core-codegen that referenced this issue Aug 29, 2024
We included the regex pattern as-is from the input which caused problems
with the regex engines as the patterns in the meta-model are written in
a Python dialect (and assuming that the regex engine works on UTF-32
characters). However, most regex engines in the wild operating on SHACL
(*e.g.*, Java SHACL validators) use UTF-16 to represent the text and do
not support some parts of the Python regex dialect. For example, in
the input meta-model specification, we omit the minimum bound 0
(*e.g.*, ``{,4}``), which breaks with the Java regex engine beneath
the SHACL validator.

Instead, with this patch, we parse the pattern from the specification
and re-render it into the form that we also use in JSON Schema. We pick
JSON Schema regex dialect as most SHACL validators in the wild can deal
with it, in particular those based on Java as a platform. Hence, we
decide to serve this user base with priority.

Discovered in [aas-core-meta issue 342].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
s-heppner added a commit to admin-shell-io/aas-specs that referenced this issue Aug 30, 2024
We included the regex pattern as-is from the input which caused problems
with the regex engines as the patterns in the meta-model are written in
a Python dialect (and assuming that the regex engine works on UTF-32
characters). However, most regex engines in the wild operating on SHACL
(*e.g.*, Java SHACL validators) use UTF-16 to represent the text and do
not support some parts of the Python regex dialect. For example, in
the input meta-model specification, we omit the minimum bound 0
(*e.g.*, ``{,4}``), which breaks with the Java regex engine beneath
the SHACL validator.

Instead, with this patch, we parse the pattern from the specification
and re-render it into the form that we also use in JSON Schema. We pick
JSON Schema regex dialect as most SHACL validators in the wild can deal
with it, in particular those based on Java as a platform. Hence, we
decide to serve this user base with priority.

Discovered in [aas-core-meta issue 342].
Fixed in [aas-core-codegen commit e22cc].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
[aas-core-codegen commit e22cc]: aas-core-works/aas-core-codegen@e22ccae
s-heppner added a commit to admin-shell-io/aas-specs that referenced this issue Nov 15, 2024
We included the regex pattern as-is from the input which caused problems
with the regex engines as the patterns in the meta-model are written in
a Python dialect (and assuming that the regex engine works on UTF-32
characters). However, most regex engines in the wild operating on SHACL
(*e.g.*, Java SHACL validators) use UTF-16 to represent the text and do
not support some parts of the Python regex dialect. For example, in
the input meta-model specification, we omit the minimum bound 0
(*e.g.*, ``{,4}``), which breaks with the Java regex engine beneath
the SHACL validator.

Instead, with this patch, we parse the pattern from the specification
and re-render it into the form that we also use in JSON Schema. We pick
JSON Schema regex dialect as most SHACL validators in the wild can deal
with it, in particular those based on Java as a platform. Hence, we
decide to serve this user base with priority.

Discovered in [aas-core-meta issue 342].
Fixed in [aas-core-codegen commit e22cc].

[aas-core-meta issue 342]: aas-core-works/aas-core-meta#342
[aas-core-codegen commit e22cc]: aas-core-works/aas-core-codegen@e22ccae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants