You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, great question! Unfortunately not yet, but this has been something I've been thinking about for a while.
I think ideally we can support arbitrary closures passed in as additional parameters to Rust Sitter annotations to provide custom scanner logic, which then gets bound to the runtime appropriately. But I haven't messed around with scanners too much myself so need to do a bit more research.
I've been playing around with external scanners for a while in a fork, and I've managed to at least replicate the C API in a way where it can be fulfilled using a Rust trait impl. There are still some things I haven't managed to do yet because I don't understand rust-sitter very well, such as... letting the user put external tokens into the grammar. But I have tested it with anonymous external tokens, which don't need to be registered.
There are a few limitations that I could see: one is that the custom scanner's output is really dumb. It's just the matched token ID, if any, with no room for payload whatsoever. Maybe something could be done by cleverly abusing the scanner's statefulness...? The whole scanner state is stored in the edit tree though, which might impact performance a lot.
For now the only thing we get in the final tree is a plain string terminal. I don't know how to retrieve that yet and put it into the user-tagged struct (and that might be impossible to do for the anonymous tokens anyway), but if that works there could also be a transform closure similar to what the leaf nodes have.
One thing that actually seems pretty powerful, is the ability for an external token to also have an internal definition using grammar rules as fallback. Defining that would require some kind of PhantomData-like trait where the generic type is the important information, but the data that the user gets might just be the plain string from the custom scanner. I don't know if tree-sitter returns a full tree in the other case though, or if it also flattens that into a plain string token. (I also think this approach could help implement the TOKEN rule type from tree-sitter if desired, but that's completely separate)
Tree-sitter allows parsers to specify external scanners to support e.g. context-sensitive indent/dedent tokenization.
Is this supported in rust-sitter?
The text was updated successfully, but these errors were encountered: