Unicode support #44

benman1 · 2020-03-30T16:50:19Z

Please implement unicode string support.

In C++, std::wstring is a wrapper for wchar_t* similar to std::string which is a wrapper for char*. wchar_t is defined in C as well [1]. A similar API in C is Glib::ustring.

The major difference to std::string is that a character is defined by 4 bytes rather than 1.

jacereda · 2020-09-09T15:22:46Z

I would stick to UTF-8 encoded strings and just implement a glyphs for getting an array of unicode code points.

aep · 2020-09-09T15:27:50Z

that sounds like a good idea to me, assuming by array you mean an iterator.
is there a portable C library for doing that?

aep · 2020-09-09T15:31:23Z

is this sane? https://github.com/adricoin2010/UTF8-Iterator

looks suspiciously simple

jacereda · 2020-09-09T16:11:33Z

Not so simple, I prefer this one:

https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Scroll down to the bottom, there's a better implementation than the one on top.

aep · 2020-09-15T09:51:03Z

ok i'm implementing this now, but i need everyone to comment on the api and its features.

here's my first draft.

string::String is similar to slice::Slice, except there's no MutSlice and the iterators are on utf8 codepoints rather than bytes
string::buffer::StringBuffer is similar to buffer::Buffer and autocasts to String

pub fn main() {

   /// construct a new string reference from a borrowed null terminated char*
   let s = string::from_cstr("你好世界");

   /// len() counts codepoints or scalar values?
   err::assert(s.len() == 4);

   /// iterator over codepoints
   for let mut it = s.iter(); it.next(); {
       /// one codepoint is 4 byte long
       u32 ch = it.val;
   }

   /// note the lack of char indexing.
   /// this is not possible
   u32 ch = s[2];

   // but you can convert it to a slice for byte indexing
   let sl = s.as_utf8_slice();
   u8 meh = sl.mem[2];

   // or copy to a vec
   new[item = u32, +100] v = s.to_vec();
   u32 bleh = v.items[2];

   /// return string as null terminated utf8 char*
   printf("%s", s.cstr());

   /// concat two strings using a string buffer
   new[+1000] b = string::buffer::make();
   b.append(string::from_cstr("hello world"));
   b.append(string::from_cstr("  "));
   b.append(string::from_cstr("你好世界"));
   
   /// borrow a buffer as str
   let x = b.as_str();   

   /// split
   usize mut iterator = 0;
   let s1 = x.split(" ", &iterator);
   let s2 = x.split(" ", &iterator);

   /// compare
   err::assert(!s1.eq(s2));

   /// substrings compares
   err::assert(s2.starts_with(string::from_cstr("你")));

}

we MIGHT also completely replace char* with string::String some day, removing the explicit calls to from_cstr, but not until we're sure string is ready

jwerle · 2020-09-15T11:09:31Z

Give me a few

jwerle · 2020-09-15T13:15:31Z

This API looks pretty straightforward and absolutely needed. I am happy that we took the approach to rewrite the string module with utf8 in mind!

My only (unrelated) question is:

// or copy to a vec
new[item = u32, +100] v = s.to_vec();

Can we do this now? (new constructor from an "instance" method)

aep · 2020-09-15T13:29:01Z

oh right, i actually forgot that's broken, thanks for the reminder

opened #123

jacereda · 2020-09-15T21:01:43Z

Why is string needed? Wouldn't a uiter() for iterating over unicode code points on a slice suffice?

aep · 2020-09-15T21:30:53Z

technically yes, but string manipulation behaves differently on unicode vs bytes. having to prefix all functions with unicode_split etc seems awkward and the type is effectively free as its just emitted as fat pointer to C

also slice holds any arbitrary binary data, string holds null terminated utf8. this distinction is useful in api contracts and automatic mapping to other type systems

aep · 2020-09-15T21:37:22Z

actually i wonder if we can use attached type aliases to implement it as specialized slice.

type String = slice::Slice[nullterm(self.mem), utf8(self.mem)];

edit: never mind, still would have to prefix utf8 specific functions, which is weird. but String can just inherit from slice by first-member rule, so you can use it as if it was a slice.

sternenseemann · 2021-01-25T13:21:09Z

I'd recommend using Julia's utf8proc which is reasonably lightweight, supports UTF-8 decoding and encoding (from and to codepoints) and other features that definitely needed for proper unicode handling like utf8 normalization and grapheme clustering.

aep added the enhancement New feature or request label Apr 1, 2020

jwerle mentioned this issue Apr 9, 2020

'utf8' module zx-project/zx#5

Open

aep mentioned this issue Jul 17, 2020

replace borrow with theory expression #94

Merged

aep added the bounty sponsorship by devguard available label Jul 22, 2020

aep added need-realworld-feedback feedback from users needed on how the decision would affect real world use and removed bounty sponsorship by devguard available labels Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support #44

Unicode support #44

benman1 commented Mar 30, 2020

jacereda commented Sep 9, 2020

aep commented Sep 9, 2020

aep commented Sep 9, 2020

jacereda commented Sep 9, 2020

aep commented Sep 15, 2020 •

edited

Loading

jwerle commented Sep 15, 2020

jwerle commented Sep 15, 2020 •

edited

Loading

aep commented Sep 15, 2020

jacereda commented Sep 15, 2020

aep commented Sep 15, 2020 •

edited

Loading

aep commented Sep 15, 2020 •

edited

Loading

sternenseemann commented Jan 25, 2021

Unicode support #44

Unicode support #44

Comments

benman1 commented Mar 30, 2020

jacereda commented Sep 9, 2020

aep commented Sep 9, 2020

aep commented Sep 9, 2020

jacereda commented Sep 9, 2020

aep commented Sep 15, 2020 • edited Loading

jwerle commented Sep 15, 2020

jwerle commented Sep 15, 2020 • edited Loading

aep commented Sep 15, 2020

jacereda commented Sep 15, 2020

aep commented Sep 15, 2020 • edited Loading

aep commented Sep 15, 2020 • edited Loading

sternenseemann commented Jan 25, 2021

aep commented Sep 15, 2020 •

edited

Loading

jwerle commented Sep 15, 2020 •

edited

Loading

aep commented Sep 15, 2020 •

edited

Loading

aep commented Sep 15, 2020 •

edited

Loading