Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch parser to multi-byte processing #118

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chrisduerr
Copy link
Member

This patch overhauls the Parser::advance API to operate on byte slices instead of individual bytes, which allows for additional performance optimizations.

VTE does not support C1 escapes and C0 escapes always start with an escape character. This makes it possible to simplify processing if a byte stream is determined to not contain any escapes. The memchr crate provides a battle-tested implementation for SIMD-accelerated byte searches, which is why this implementation makes use of it.

VTE also only supports UTF8 characters in the ground state, which means that the new non-escape parsing path is able to rely completely on STD's str::from_utf8 since memchr gives us the full length of the plain text character buffer. This allows us to completely remove utf8parse and all related code.

We also make use of memchr in the synchronized escape handling in ansi.rs, since it realies heavily on scanning large amounts of text for the extension/termination escape sequences.

@chrisduerr chrisduerr requested a review from kchibisov December 20, 2024 02:39
@chrisduerr
Copy link
Member Author

Performance seems objectively better. There might be other things that can be improved but these are all the ideas I could come up with.

Obviously a breaking change, but I think that's fine considering how long VTE has been stable for. I think intentional breakage is better than providing the old API in a way that would be slower, that way people are aware that they should switch to a buffer-based approach.

Probably makes sense to resolve alacritty/vtebench#40 before merging this, but I don't expect it to impact this PR in any way.

This patch overhauls the `Parser::advance` API to operate on byte slices
instead of individual bytes, which allows for additional performance
optimizations.

VTE does not support C1 escapes and C0 escapes always start with an
escape character. This makes it possible to simplify processing if a
byte stream is determined to not contain any escapes. The `memchr` crate
provides a battle-tested implementation for SIMD-accelerated byte
searches, which is why this implementation makes use of it.

VTE also only supports UTF8 characters in the ground state, which means
that the new non-escape parsing path is able to rely completely on STD's
`str::from_utf8` since `memchr` gives us the full length of the plain
text character buffer. This allows us to completely remove `utf8parse`
and all related code.

We also make use of `memchr` in the synchronized escape handling in
`ansi.rs`, since it realies heavily on scanning large amounts of text
for the extension/termination escape sequences.
statemachine.advance(&mut performer, *byte);
}
},
Ok(n) => statemachine.advance(&mut performer, &buf[..n]),
Copy link
Contributor

@nixpulvis nixpulvis Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect most people not to have trouble migrating, especially given how nice this example's change is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect most people not to have trouble migrating, especially given how nice this example's change is.

Yes I agree that most people will likely have less code after migration, since most implementations already were using buffers anyway. That's one of the motivating factors behind this patch and also why I wanted to intentionally break things to make people aware of the faster API.

Ground = 12,
OscString = 13,
SosPmApcString = 14,
Utf8 = 15,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how this ended up being unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only two places where UTF8 content can possibly appear, since escapes are generally ascii: In the ground state, so outside of the escape sequences and in OSCs.

Since the ground state is now handled by looking up the entire byte buffer and converting using str::from_utf8, we don't need a byte-by-byte parser for this anymore. And OSCs were never using the Utf8 state since we're just passing the OSC data as raw bytes.

}

// NOTE: Removing the unused actions prefixed with `_` will reduce performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to add a short explanation as to why. Or at least, I'm curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So am I, let me know if you ever find out.

//! # Differences from original state machine description
//!
//! * UTF-8 Support for Input
//! * OSC Strings can be terminated by 0x07
//! * Only supports 7-bit codes. Some 8-bit codes are still supported, but they no longer work in
//! all states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new change or just an outdated comment? I seem to remember this changing before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it doesn't seem entirely accurate even for the latest version of stable VTE. Especially the '8-bit codes are still supported, but they no longer work in all states' part just seems unnecessarily confusing.

It's easier to just support only 7-bit codes and not make any exceptions.

// Generate state changes at compile-time
pub static STATE_CHANGES: [[u8; 256]; 16] = state_changes();
pub static STATE_CHANGES: [[u8; 256]; 13] = state_changes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

I'm sorta surprised and impressed this could be minimized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of the ground state certainly helped. But the other two were just unnecessary to begin with it seems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants