Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Explanation of RegEx and Reason for AASd-130 #381

Open
wants to merge 8 commits into
base: IDTA-01001-3-1_working
Choose a base branch
from

Conversation

g1zzm0
Copy link
Collaborator

@g1zzm0 g1zzm0 commented Mar 8, 2024

This was necessary because there was a broad disagreement in different
committees and persons about what the AASd-130 constraint says, what
the RegEx in AASd-130 means and what the reason for the definition of
the constraint was.

@g1zzm0 g1zzm0 added documentation Improvements or additions to documentation requires workstream approval strategic decision in spec team needed labels Mar 8, 2024
Copy link
Collaborator

@mristin mristin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments and suggestions.

Comment on lines +75 to +77
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

Copy link
Collaborator Author

@g1zzm0 g1zzm0 Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think someone will need this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision was to support what XML Schema 1.0 is supporting. Marko suggest to further restrict it, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that removing these three lines would restrict anything further as they are only explanatory for the reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

Remembering our discussion, we may can reformulate this one a bit to make it more specific:
"It assumes that the entire string adheres to the rules of UTF-16 encoding, which is the current standard way of representing a wide range of characters from different languages."

As far as I got the context, a UTF-32-enabled application would represent a file slightly different, no surrogate pairs needed, and therefore the regex pattern representing this constraint would need to look differently for it. But the whole UTF-16 vs. UTF-32 separation does not affect the constraint itself but it's representation in the schemas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the whole UTF-16 vs. UTF-32 separation does not affect the constraint itself but it's representation in the schemas.

So how about we replace the above sentence with something like a design decision: "For the current versions of the specification, this constraint is represented as a regex pattern expecting UTF-16 compliant applications"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for the current versions"? what does this mean? It is not clear to me what we really request and expect (in the future and today).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try another formulation:

"Note: The constraint AASd-130 is represented as a regex pattern expecting UTF-16 compliant applications. It might be necessary to adjust this pattern for UTF-32 compliant applications in future versions of this specification."

@BirgitBoss BirgitBoss added this to the V3.0.1 milestone Mar 27, 2024
@BirgitBoss
Copy link
Collaborator

is it V3.0.1 or V3.1: what is the related issue?

[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters.

\x09: ASCII horizontal tab.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following list seems to be redundant to the list above? Remove one of them?

@JoergNeidig
Copy link

@sebbader-sap @sebbader Can you please review this issue?

Copy link
Contributor

@sebbader-sap sebbader-sap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the lists are not properly rendered in the Github preview. Could be that the antora magic kicks in and solves it, not sure.

Comment on lines +75 to +77
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that removing these three lines would restrict anything further as they are only explanatory for the reader.

Comment on lines +75 to +77
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

Remembering our discussion, we may can reformulate this one a bit to make it more specific:
"It assumes that the entire string adheres to the rules of UTF-16 encoding, which is the current standard way of representing a wide range of characters from different languages."

As far as I got the context, a UTF-32-enabled application would represent a file slightly different, no surrogate pairs needed, and therefore the regex pattern representing this constraint would need to look differently for it. But the whole UTF-16 vs. UTF-32 separation does not affect the constraint itself but it's representation in the schemas.

Comment on lines +75 to +77
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the whole UTF-16 vs. UTF-32 separation does not affect the constraint itself but it's representation in the schemas.

So how about we replace the above sentence with something like a design decision: "For the current versions of the specification, this constraint is represented as a regex pattern expecting UTF-16 compliant applications"?


This leads to the following regular expression:
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]$
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$

@BirgitBoss BirgitBoss modified the milestones: V3.0.1, V3.1 Apr 12, 2024
Copy link
Collaborator

@BirgitBoss BirgitBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebbader-sap
Copy link
Contributor

The problem is that we can "serialise" AASd-130 into different regex patterns due to the fact that regex itself is underspecified. I tried to explain our decision for the UTF-16 representation a bit better, and that this representation might change when UTF-32-enabled regex libraries become more common.

@BirgitBoss
Copy link
Collaborator

@g1zzm0 can you please have a look at the suggested changes of Sebastian and the merge?

BirgitBoss and others added 2 commits November 27, 2024 15:47
Co-authored-by: sebbader-sap <[email protected]>
Co-authored-by: sebbader-sap <[email protected]>
@BirgitBoss
Copy link
Collaborator

@s-heppner may you please check what we have implemented in the schema? #381 (comment) Thank you

@s-heppner
Copy link
Collaborator

In the v3.1 schema, the regex from the comment is implemented: aas-core-meta v3.1 L419

Is that as expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation requires workstream approval strategic decision in spec team needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants