[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783
Labels
good first issue
Good for newcomers
help wanted
Extra attention is needed
lang-design
Issues or PRs related to kcl language design and KEPs
stdlib
Issues or PRs related to kcl standard libraries
Feature Request
Is your feature request related to a problem? Please describe:
Currently, string operations in KCL seem inconsistent when handling Unicode and ASCII strings. For example, the
len
function returns the byte length for Unicode strings but the character count for ASCII strings. Similarly, index-based operations likefind
andrfind
appear to work with byte offsets, while slicing seems to handle code points. This inconsistency makes it challenging to work with multi-byte Unicode strings effectively.Below are my test cases and the results in KCL v0.11.0-alpha.1:
Execution Results:
Issues Observed:
len
Function:"一.三"
,len
returns 7, which seems to represent the byte length."1.3"
,len
returns 3, representing the character count.Index Operations (
index
,rindex
,find
,rfind
):Describe the feature you'd like:
Introduce a built-in method (or an enhancement to existing methods) that allows consistent handling of strings, either entirely based on bytes or entirely on code points. Specifically:
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy:
Adding such methods would simplify handling Unicode strings for KCL users. For example:
string.char_count()
could return the number of code points in a string.find
/rfind
could include an option to operate on code points.Documentation should include examples of how these methods work with both ASCII and multi-byte Unicode strings.
By implementing this, developers would find it easier to handle strings in KCL, especially in scenarios involving mixed character sets.
The text was updated successfully, but these errors were encountered: