-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch parser to multi-byte processing #118
base: master
Are you sure you want to change the base?
Conversation
Performance seems objectively better. There might be other things that can be improved but these are all the ideas I could come up with. Obviously a breaking change, but I think that's fine considering how long VTE has been stable for. I think intentional breakage is better than providing the old API in a way that would be slower, that way people are aware that they should switch to a buffer-based approach. Probably makes sense to resolve alacritty/vtebench#40 before merging this, but I don't expect it to impact this PR in any way. |
fbe3273
to
9503eaf
Compare
This patch overhauls the `Parser::advance` API to operate on byte slices instead of individual bytes, which allows for additional performance optimizations. VTE does not support C1 escapes and C0 escapes always start with an escape character. This makes it possible to simplify processing if a byte stream is determined to not contain any escapes. The `memchr` crate provides a battle-tested implementation for SIMD-accelerated byte searches, which is why this implementation makes use of it. VTE also only supports UTF8 characters in the ground state, which means that the new non-escape parsing path is able to rely completely on STD's `str::from_utf8` since `memchr` gives us the full length of the plain text character buffer. This allows us to completely remove `utf8parse` and all related code. We also make use of `memchr` in the synchronized escape handling in `ansi.rs`, since it realies heavily on scanning large amounts of text for the extension/termination escape sequences.
9503eaf
to
6c3695b
Compare
statemachine.advance(&mut performer, *byte); | ||
} | ||
}, | ||
Ok(n) => statemachine.advance(&mut performer, &buf[..n]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect most people not to have trouble migrating, especially given how nice this example's change is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect most people not to have trouble migrating, especially given how nice this example's change is.
Yes I agree that most people will likely have less code after migration, since most implementations already were using buffers anyway. That's one of the motivating factors behind this patch and also why I wanted to intentionally break things to make people aware of the faster API.
Ground = 12, | ||
OscString = 13, | ||
SosPmApcString = 14, | ||
Utf8 = 15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how this ended up being unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only two places where UTF8 content can possibly appear, since escapes are generally ascii: In the ground state, so outside of the escape sequences and in OSCs.
Since the ground state is now handled by looking up the entire byte buffer and converting using str::from_utf8
, we don't need a byte-by-byte parser for this anymore. And OSCs were never using the Utf8 state since we're just passing the OSC data as raw bytes.
} | ||
|
||
// NOTE: Removing the unused actions prefixed with `_` will reduce performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to add a short explanation as to why. Or at least, I'm curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So am I, let me know if you ever find out.
//! # Differences from original state machine description | ||
//! | ||
//! * UTF-8 Support for Input | ||
//! * OSC Strings can be terminated by 0x07 | ||
//! * Only supports 7-bit codes. Some 8-bit codes are still supported, but they no longer work in | ||
//! all states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new change or just an outdated comment? I seem to remember this changing before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it doesn't seem entirely accurate even for the latest version of stable VTE. Especially the '8-bit codes are still supported, but they no longer work in all states' part just seems unnecessarily confusing.
It's easier to just support only 7-bit codes and not make any exceptions.
// Generate state changes at compile-time | ||
pub static STATE_CHANGES: [[u8; 256]; 16] = state_changes(); | ||
pub static STATE_CHANGES: [[u8; 256]; 13] = state_changes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
I'm sorta surprised and impressed this could be minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of the ground state certainly helped. But the other two were just unnecessary to begin with it seems.
This patch overhauls the
Parser::advance
API to operate on byte slices instead of individual bytes, which allows for additional performance optimizations.VTE does not support C1 escapes and C0 escapes always start with an escape character. This makes it possible to simplify processing if a byte stream is determined to not contain any escapes. The
memchr
crate provides a battle-tested implementation for SIMD-accelerated byte searches, which is why this implementation makes use of it.VTE also only supports UTF8 characters in the ground state, which means that the new non-escape parsing path is able to rely completely on STD's
str::from_utf8
sincememchr
gives us the full length of the plain text character buffer. This allows us to completely removeutf8parse
and all related code.We also make use of
memchr
in the synchronized escape handling inansi.rs
, since it realies heavily on scanning large amounts of text for the extension/termination escape sequences.