Switch parser to multi-byte processing #118

chrisduerr · 2024-12-20T02:39:13Z

This patch overhauls the Parser::advance API to operate on byte slices instead of individual bytes, which allows for additional performance optimizations.

VTE does not support C1 escapes and C0 escapes always start with an escape character. This makes it possible to simplify processing if a byte stream is determined to not contain any escapes. The memchr crate provides a battle-tested implementation for SIMD-accelerated byte searches, which is why this implementation makes use of it.

VTE also only supports UTF8 characters in the ground state, which means that the new non-escape parsing path is able to rely completely on STD's str::from_utf8 since memchr gives us the full length of the plain text character buffer. This allows us to completely remove utf8parse and all related code.

We also make use of memchr in the synchronized escape handling in ansi.rs, since it realies heavily on scanning large amounts of text for the extension/termination escape sequences.

chrisduerr · 2024-12-20T02:41:00Z

Performance seems objectively better. There might be other things that can be improved but these are all the ideas I could come up with.

Obviously a breaking change, but I think that's fine considering how long VTE has been stable for. I think intentional breakage is better than providing the old API in a way that would be slower, that way people are aware that they should switch to a buffer-based approach.

Probably makes sense to resolve alacritty/vtebench#40 before merging this, but I don't expect it to impact this PR in any way.

This patch overhauls the `Parser::advance` API to operate on byte slices instead of individual bytes, which allows for additional performance optimizations. VTE does not support C1 escapes and C0 escapes always start with an escape character. This makes it possible to simplify processing if a byte stream is determined to not contain any escapes. The `memchr` crate provides a battle-tested implementation for SIMD-accelerated byte searches, which is why this implementation makes use of it. VTE also only supports UTF8 characters in the ground state, which means that the new non-escape parsing path is able to rely completely on STD's `str::from_utf8` since `memchr` gives us the full length of the plain text character buffer. This allows us to completely remove `utf8parse` and all related code. We also make use of `memchr` in the synchronized escape handling in `ansi.rs`, since it realies heavily on scanning large amounts of text for the extension/termination escape sequences.

nixpulvis · 2024-12-20T15:49:16Z

examples/parselog.rs

-                    statemachine.advance(&mut performer, *byte);
-                }
-            },
+            Ok(n) => statemachine.advance(&mut performer, &buf[..n]),


I would expect most people not to have trouble migrating, especially given how nice this example's change is.

I would expect most people not to have trouble migrating, especially given how nice this example's change is.

Yes I agree that most people will likely have less code after migration, since most implementations already were using buffers anyway. That's one of the motivating factors behind this patch and also why I wanted to intentionally break things to make people aware of the faster API.

nixpulvis · 2024-12-20T15:56:30Z

src/definitions.rs

-    Ground = 12,
-    OscString = 13,
-    SosPmApcString = 14,
-    Utf8 = 15,


I'm curious how this ended up being unnecessary.

There's only two places where UTF8 content can possibly appear, since escapes are generally ascii: In the ground state, so outside of the escape sequences and in OSCs.

Since the ground state is now handled by looking up the entire byte buffer and converting using str::from_utf8, we don't need a byte-by-byte parser for this anymore. And OSCs were never using the Utf8 state since we're just passing the OSC data as raw bytes.

nixpulvis · 2024-12-20T15:57:51Z

src/definitions.rs

 }

+// NOTE: Removing the unused actions prefixed with `_` will reduce performance.


Might be nice to add a short explanation as to why. Or at least, I'm curious.

So am I, let me know if you ever find out.

nixpulvis · 2024-12-20T16:01:33Z

src/lib.rs

 //! # Differences from original state machine description
 //!
 //! * UTF-8 Support for Input
 //! * OSC Strings can be terminated by 0x07
-//! * Only supports 7-bit codes. Some 8-bit codes are still supported, but they no longer work in
-//!   all states.


Is this a new change or just an outdated comment? I seem to remember this changing before.

Yeah it doesn't seem entirely accurate even for the latest version of stable VTE. Especially the '8-bit codes are still supported, but they no longer work in all states' part just seems unnecessarily confusing.

It's easier to just support only 7-bit codes and not make any exceptions.

nixpulvis · 2024-12-20T16:06:08Z

src/table.rs

 // Generate state changes at compile-time
-pub static STATE_CHANGES: [[u8; 256]; 16] = state_changes();
+pub static STATE_CHANGES: [[u8; 256]; 13] = state_changes();


😄

I'm sorta surprised and impressed this could be minimized.

Getting rid of the ground state certainly helped. But the other two were just unnecessary to begin with it seems.

chrisduerr requested a review from kchibisov December 20, 2024 02:39

chrisduerr force-pushed the need_for_speed branch from fbe3273 to 9503eaf Compare December 20, 2024 02:41

chrisduerr force-pushed the need_for_speed branch from 9503eaf to 6c3695b Compare December 20, 2024 02:46

nixpulvis reviewed Dec 20, 2024

View reviewed changes

chrisduerr mentioned this pull request Dec 20, 2024

Add support for custom parsing of APC, SOS and PM sequences. #115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch parser to multi-byte processing #118

Switch parser to multi-byte processing #118

chrisduerr commented Dec 20, 2024

chrisduerr commented Dec 20, 2024

nixpulvis Dec 20, 2024 •

edited

Loading

chrisduerr Dec 20, 2024

nixpulvis Dec 20, 2024

chrisduerr Dec 20, 2024

nixpulvis Dec 20, 2024

chrisduerr Dec 20, 2024

nixpulvis Dec 20, 2024

chrisduerr Dec 20, 2024

nixpulvis Dec 20, 2024

chrisduerr Dec 20, 2024

		}

		// NOTE: Removing the unused actions prefixed with `_` will reduce performance.

Switch parser to multi-byte processing #118

Are you sure you want to change the base?

Switch parser to multi-byte processing #118

Conversation

chrisduerr commented Dec 20, 2024

chrisduerr commented Dec 20, 2024

nixpulvis Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nixpulvis Dec 20, 2024 •

edited

Loading