Fix crashes with invalid utf-8. #59

kurnevsky · 2024-12-17T16:44:23Z

According to the discussion in emacs bug#74922 it's possible that emacs passes invalid strings to dynamic libraries.

Fixes #58

According to the discussion in emacs bug#74922 it's possible that emacs passes invalid strings to dynamic libraries.

ubolonton · 2024-12-20T03:00:32Z

Thanks for the investigation and the fix. Great finding!

According to the discussion in emacs bug#74922 it's possible that emacs passes invalid strings to dynamic libraries.

That's inconsistent with the answer here, though it's consistent with my observation in #28 (comment).

Either way, Emacs's documentation here is too fuzzy, leaving this an undefined behavior. First, we'll need to figure out what Emacs's intended behavior is. Then, we can decide on the suitable approach. Similar to how we dealt with the Emacs's GC bug in #2.

kurnevsky · 2024-12-20T06:48:59Z

@Eli-Zaretskii could you please check the answer above?

Eli-Zaretskii · 2024-12-20T08:03:18Z

@Eli-Zaretskii could you please check the answer above?

Sorry, I don't understand what you are asking me to check. Which "answer above", and what to check in it?

kurnevsky · 2024-12-20T10:11:23Z

Can we decide that the answer here is not correct and it's expected that copy_string_contents can return invalid utf-8? And therefore we need to validate it in the external library.

Eli-Zaretskii · 2024-12-20T12:41:02Z

Can we decide that the answer here is not correct and it's expected that copy_string_contents can return invalid utf-8? And therefore we need to validate it in the external library.

I don't know. I didn't yet have time to step through the code with a debugger and see what happens there and why. And FWIW, it doesn't help that the original bug report doesn't provide any reproduction recipe except by using a module written in Rust, which I cannot even compile on my development system.

All I can say at this point is that the module-support code in emacs-module.c seems to behave according to the documented interface of the function from coding.c it is using, so this ought to behave as you expect. But the Emacs encoding machinery is very complex, and its design doesn't inherently guarantee that any raw bytes will be flagged; instead, it makes every effort to produce those raw bytes back as they were input, thereby leaving the handling of this to the application. It could very well be that what you see is one consequence of this basic design. I also doubt that this particular behavior was extensively tested, since the assumption was that modules deal with text, not with binary junk.

If you are in a hurry to get the answers, I suggest that you step with GDB through the code and tell us what you see, including why Emacs doesn't signal an error in this particular case.

But in general, I don't recommend modules to rely on behavior that is only documented in comments to the source code. If you need valid UTF-8 strings, I suggest to verify that in the first place.

Having said all that, I have this issue on my todo, and will get to investigating it eventually, as my time permits.

Eli-Zaretskii · 2024-12-21T12:08:44Z

I think I know what happened: your module called copy_string_contents with a unibyte string. In that case, copy_string_contents will return you the original string without doing anything. The code in copy_string_contents that signals an error relies on the fact that encoding the input string yields nil if the input includes non-Unicode characters. But that cannot be established with unibyte strings, because a unibyte string doesn't hold characters, it holds raw bytes.

What you should do is make sure the string passed to copy_string_contents is a multibyte string. If I do that, i.e.

    (switch-to-buffer "foo")
    (set-buffer-multibyte t)
    (insert-file-contents "/path/to/wg-private-pc.age")
    (setq str1 (buffer-string))

and then call copy_string_contents with the resulting string str1, I get the result you expected.

You need to realize that copy_string_contents is a variant of text-encoding routines: it encodes the input multibyte string in
UTF-8. The encoding routines in Emacs always return unibyte strings without doing anything, because a unibyte string is already encoded, or at least is supposed to be encoded.

And before you ask: no, copy_string_contents cannot by itself signal an error if passed a unibyte string, because a unibyte string can legitimately be a valid UTF-8 string. So in this case, copy_string_contents relies on the caller to make sure the input is valid UTF-8.

Fix crashes with invalid utf-8.

297bdd7

According to the discussion in emacs bug#74922 it's possible that emacs passes invalid strings to dynamic libraries.

kurnevsky force-pushed the utf branch from dd6fec5 to 297bdd7 Compare December 17, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crashes with invalid utf-8. #59

Fix crashes with invalid utf-8. #59

kurnevsky commented Dec 17, 2024

ubolonton commented Dec 20, 2024 •

edited

Loading

kurnevsky commented Dec 20, 2024

Eli-Zaretskii commented Dec 20, 2024

kurnevsky commented Dec 20, 2024

Eli-Zaretskii commented Dec 20, 2024

Eli-Zaretskii commented Dec 21, 2024

Fix crashes with invalid utf-8. #59

Are you sure you want to change the base?

Fix crashes with invalid utf-8. #59

Conversation

kurnevsky commented Dec 17, 2024

ubolonton commented Dec 20, 2024 • edited Loading

kurnevsky commented Dec 20, 2024

Eli-Zaretskii commented Dec 20, 2024

kurnevsky commented Dec 20, 2024

Eli-Zaretskii commented Dec 20, 2024

Eli-Zaretskii commented Dec 21, 2024

ubolonton commented Dec 20, 2024 •

edited

Loading