Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does rust-sitter support external scanners? #44

Open
sleexyz opened this issue May 26, 2023 · 2 comments
Open

Does rust-sitter support external scanners? #44

sleexyz opened this issue May 26, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@sleexyz
Copy link

sleexyz commented May 26, 2023

Tree-sitter allows parsers to specify external scanners to support e.g. context-sensitive indent/dedent tokenization.

Is this supported in rust-sitter?

@shadaj
Copy link
Member

shadaj commented Aug 2, 2023

Hi, great question! Unfortunately not yet, but this has been something I've been thinking about for a while.

I think ideally we can support arbitrary closures passed in as additional parameters to Rust Sitter annotations to provide custom scanner logic, which then gets bound to the runtime appropriately. But I haven't messed around with scanners too much myself so need to do a bit more research.

@shadaj shadaj added the enhancement New feature or request label Aug 2, 2023
@ilonachan
Copy link
Contributor

I've been playing around with external scanners for a while in a fork, and I've managed to at least replicate the C API in a way where it can be fulfilled using a Rust trait impl. There are still some things I haven't managed to do yet because I don't understand rust-sitter very well, such as... letting the user put external tokens into the grammar. But I have tested it with anonymous external tokens, which don't need to be registered.

There are a few limitations that I could see: one is that the custom scanner's output is really dumb. It's just the matched token ID, if any, with no room for payload whatsoever. Maybe something could be done by cleverly abusing the scanner's statefulness...? The whole scanner state is stored in the edit tree though, which might impact performance a lot.

For now the only thing we get in the final tree is a plain string terminal. I don't know how to retrieve that yet and put it into the user-tagged struct (and that might be impossible to do for the anonymous tokens anyway), but if that works there could also be a transform closure similar to what the leaf nodes have.

One thing that actually seems pretty powerful, is the ability for an external token to also have an internal definition using grammar rules as fallback. Defining that would require some kind of PhantomData-like trait where the generic type is the important information, but the data that the user gets might just be the plain string from the custom scanner. I don't know if tree-sitter returns a full tree in the other case though, or if it also flattens that into a plain string token. (I also think this approach could help implement the TOKEN rule type from tree-sitter if desired, but that's completely separate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants