Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

Open
suin opened this issue Dec 11, 2024 · 1 comment · May be fixed by #1793
Open

[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

suin opened this issue Dec 11, 2024 · 1 comment · May be fixed by #1793
Labels
good first issue Good for newcomers help wanted Extra attention is needed lang-design Issues or PRs related to kcl language design and KEPs stdlib Issues or PRs related to kcl standard libraries

Comments

@suin
Copy link

suin commented Dec 11, 2024

Feature Request

Is your feature request related to a problem? Please describe:
Currently, string operations in KCL seem inconsistent when handling Unicode and ASCII strings. For example, the len function returns the byte length for Unicode strings but the character count for ASCII strings. Similarly, index-based operations like find and rfind appear to work with byte offsets, while slicing seems to handle code points. This inconsistency makes it challenging to work with multi-byte Unicode strings effectively.

Below are my test cases and the results in KCL v0.11.0-alpha.1:

test_unicode = lambda {
    string = "一.三"
    len = len(string)
    index = string.rindex(".")
    rindex = string.rindex(".")
    find = string.find(".")
    rfind = string.rfind(".")
    before_separator = string[0:index:]
    after_separator = string[index + 1:len:]
    print({
        len: len,
        index: index,
        rindex: rindex,
        find: find,
        rfind: rfind,
        before_separator: before_separator,
        after_separator: after_separator
    })
}

test_ascii = lambda {
    string = "1.3"
    len = len(string)
    index = string.rindex(".")
    rindex = string.rindex(".")
    find = string.find(".")
    rfind = string.rfind(".")
    before_separator = string[0:index:]
    after_separator = string[index + 1:len:]
    print({
        len: len,
        index: index,
        rindex: rindex,
        find: find,
        rfind: rfind,
        before_separator: before_separator,
        after_separator: after_separator
    })
}

Execution Results:

test_unicode: PASS (14ms)
{'len': 7, 'index': 3, 'rindex': 3, 'find': 3, 'rfind': 3, 'before_separator': '一.三', 'after_separator': ''}

test_ascii: PASS (15ms)
{'len': 3, 'index': 1, 'rindex': 1, 'find': 1, 'rfind': 1, 'before_separator': '1', 'after_separator': '3'}

Issues Observed:

  1. len Function:

    • For the Unicode string "一.三", len returns 7, which seems to represent the byte length.
    • For the ASCII string "1.3", len returns 3, representing the character count.
  2. Index Operations (index, rindex, find, rfind):

    • Both Unicode and ASCII strings return indices that appear to be based on byte offsets rather than character positions.

Describe the feature you'd like:
Introduce a built-in method (or an enhancement to existing methods) that allows consistent handling of strings, either entirely based on bytes or entirely on code points. Specifically:

  1. A method to count characters (code points) in a string instead of bytes.
  2. Index-based operations that work with code point positions rather than byte offsets.

Describe alternatives you've considered:

  • Using external libraries or utilities (kcl plugins) to preprocess strings outside KCL before working with them.
  • Manually handling byte offsets and converting them to code point indices, which is error-prone and inefficient.

Teachability, Documentation, Adoption, Migration Strategy:
Adding such methods would simplify handling Unicode strings for KCL users. For example:

  • string.char_count() could return the number of code points in a string.
  • Modifications to find/rfind could include an option to operate on code points.
    Documentation should include examples of how these methods work with both ASCII and multi-byte Unicode strings.

By implementing this, developers would find it easier to handle strings in KCL, especially in scenarios involving mixed character sets.

@He1pa
Copy link
Contributor

He1pa commented Dec 16, 2024

According to the current implementation, the processing of Unicode and ASCII String is based on byte offset. This is also a more common practice.
For example, in rust,

fn main() {
    let a: &str = "一.三";
    println!("{}", a.len()); // 7
    println!("{}", a.chars().count()); // 3
    let c: String = "一.三".to_string();
    println!("{}", c.len());   // 3
}

I think it is reasonable to add a chars() function to the string

@He1pa He1pa added good first issue Good for newcomers help wanted Extra attention is needed lang-design Issues or PRs related to kcl language design and KEPs stdlib Issues or PRs related to kcl standard libraries labels Dec 16, 2024
@jellllly420 jellllly420 linked a pull request Dec 18, 2024 that will close this issue
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed lang-design Issues or PRs related to kcl language design and KEPs stdlib Issues or PRs related to kcl standard libraries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants