[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

suin · 2024-12-11T07:47:43Z

Feature Request

Is your feature request related to a problem? Please describe:
Currently, string operations in KCL seem inconsistent when handling Unicode and ASCII strings. For example, the len function returns the byte length for Unicode strings but the character count for ASCII strings. Similarly, index-based operations like find and rfind appear to work with byte offsets, while slicing seems to handle code points. This inconsistency makes it challenging to work with multi-byte Unicode strings effectively.

Below are my test cases and the results in KCL v0.11.0-alpha.1:

test_unicode = lambda {
    string = "一.三"
    len = len(string)
    index = string.rindex(".")
    rindex = string.rindex(".")
    find = string.find(".")
    rfind = string.rfind(".")
    before_separator = string[0:index:]
    after_separator = string[index + 1:len:]
    print({
        len: len,
        index: index,
        rindex: rindex,
        find: find,
        rfind: rfind,
        before_separator: before_separator,
        after_separator: after_separator
    })
}

test_ascii = lambda {
    string = "1.3"
    len = len(string)
    index = string.rindex(".")
    rindex = string.rindex(".")
    find = string.find(".")
    rfind = string.rfind(".")
    before_separator = string[0:index:]
    after_separator = string[index + 1:len:]
    print({
        len: len,
        index: index,
        rindex: rindex,
        find: find,
        rfind: rfind,
        before_separator: before_separator,
        after_separator: after_separator
    })
}

Execution Results:

test_unicode: PASS (14ms)
{'len': 7, 'index': 3, 'rindex': 3, 'find': 3, 'rfind': 3, 'before_separator': '一.三', 'after_separator': ''}

test_ascii: PASS (15ms)
{'len': 3, 'index': 1, 'rindex': 1, 'find': 1, 'rfind': 1, 'before_separator': '1', 'after_separator': '3'}

Issues Observed:

len Function:
- For the Unicode string "一.三", len returns 7, which seems to represent the byte length.
- For the ASCII string "1.3", len returns 3, representing the character count.
Index Operations (index, rindex, find, rfind):
- Both Unicode and ASCII strings return indices that appear to be based on byte offsets rather than character positions.

Describe the feature you'd like:
Introduce a built-in method (or an enhancement to existing methods) that allows consistent handling of strings, either entirely based on bytes or entirely on code points. Specifically:

A method to count characters (code points) in a string instead of bytes.
Index-based operations that work with code point positions rather than byte offsets.

Describe alternatives you've considered:

Using external libraries or utilities (kcl plugins) to preprocess strings outside KCL before working with them.
Manually handling byte offsets and converting them to code point indices, which is error-prone and inefficient.

Teachability, Documentation, Adoption, Migration Strategy:
Adding such methods would simplify handling Unicode strings for KCL users. For example:

string.char_count() could return the number of code points in a string.
Modifications to find/rfind could include an option to operate on code points.
Documentation should include examples of how these methods work with both ASCII and multi-byte Unicode strings.

By implementing this, developers would find it easier to handle strings in KCL, especially in scenarios involving mixed character sets.

The text was updated successfully, but these errors were encountered:

He1pa · 2024-12-16T03:19:03Z

According to the current implementation, the processing of Unicode and ASCII String is based on byte offset. This is also a more common practice.
For example, in rust,

fn main() {
    let a: &str = "一.三";
    println!("{}", a.len()); // 7
    println!("{}", a.chars().count()); // 3
    let c: String = "一.三".to_string();
    println!("{}", c.len());   // 3
}

I think it is reasonable to add a chars() function to the string

He1pa added good first issue Good for newcomers help wanted Extra attention is needed lang-design Issues or PRs related to kcl language design and KEPs stdlib Issues or PRs related to kcl standard libraries labels Dec 16, 2024

jellllly420 linked a pull request Dec 18, 2024 that will close this issue

feat: add chars method for builtin str #1793

Open

16 tasks

He1pa mentioned this issue Dec 19, 2024

Error when traversing unicode string #1796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

suin commented Dec 11, 2024

He1pa commented Dec 16, 2024 •

edited

Loading

[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

[Feature Request] Enhance String Handling Consistency for Unicode and ASCII in KCL #1783

Comments

suin commented Dec 11, 2024

Feature Request

He1pa commented Dec 16, 2024 • edited Loading

He1pa commented Dec 16, 2024 •

edited

Loading