Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Explanation of RegEx and Reason for AASd-130 #381

Open
wants to merge 8 commits into
base: IDTA-01001-3-1_working
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,38 @@ Note: The semanticId of a SpecificAssetId with the predefined name "gloablAssetI

{aasd130}

Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. It corresponds to the restrictions as defined for the XML Schema 1.0footnote:[https://www.w3.org/TR/xml/#charsets].
Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. It corresponds to the restrictions as defined for the XML Schema 1.0footnote:[https://www.w3.org/TR/xml/#charsets].

Therefore, we need to restrict an attribute of data type 'string' to the characters that can be represented in any exchange format and language.
Otherwise, strings in other formats such as JSON could not be converted to XML.

The string contains only valid Unicode characters in the range of encoded in UTF-16 format
The character set of XML includes (given as numerical code points and/or ranges in Unicode):
* 0x09: ASCII horizontal tab,
* 0x0A: ASCII linefeed (newline),
* 0x0D: ASCII carriage return.
* 0x20: ASCII space,
* 0x20 - 0xD7FF: all the characters of the Basic Multilingual Plane, and
* 0x00010000-0x0010FFFF: all the characters beyond the Basic Multilingual Plane (*e.g.*, emoticons).
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.
Comment on lines +86 to +88
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The string can include common characters like tabs, newlines, carriage returns, and spaces.
It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding.
It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

Copy link
Collaborator Author

@g1zzm0 g1zzm0 Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think someone will need this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision was to support what XML Schema 1.0 is supporting. Marko suggest to further restrict it, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that removing these three lines would restrict anything further as they are only explanatory for the reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

Remembering our discussion, we may can reformulate this one a bit to make it more specific:
"It assumes that the entire string adheres to the rules of UTF-16 encoding, which is the current standard way of representing a wide range of characters from different languages."

As far as I got the context, a UTF-32-enabled application would represent a file slightly different, no surrogate pairs needed, and therefore the regex pattern representing this constraint would need to look differently for it. But the whole UTF-16 vs. UTF-32 separation does not affect the constraint itself but it's representation in the schemas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the whole UTF-16 vs. UTF-32 separation does not affect the constraint itself but it's representation in the schemas.

So how about we replace the above sentence with something like a design decision: "For the current versions of the specification, this constraint is represented as a regex pattern expecting UTF-16 compliant applications"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for the current versions"? what does this mean? It is not clear to me what we really request and expect (in the future and today).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try another formulation:

"Note: The constraint AASd-130 is represented as a regex pattern expecting UTF-16 compliant applications. It might be necessary to adjust this pattern for UTF-32 compliant applications in future versions of this specification."


This leads to the following regular expression:
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]$
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$


g1zzm0 marked this conversation as resolved.
Show resolved Hide resolved
Where:
^: Asserts the start of the string.
[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters, with the following elements:

\x09: ASCII horizontal tab.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following list seems to be redundant to the list above? Remove one of them?

\x0A: ASCII linefeed (newline).
\x0D: ASCII carriage return.
\x20: ASCII space.
-: Represents a range.
\uD7FF: The upper limit of the Basic Multilingual Plane (BMP) in UTF-16.
\uE000-\uFFFD: Represents the range of characters from the start of the supplementary planes up to the last valid Unicode character (excluding surrogate pairs).
\u00010000-\u0010FFFF: Represents the range of valid surrogate pairs used for characters beyond the BMP.
*: Allows for zero or more occurrences of the characters within the character class.

$: Asserts the end of the string.