-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated programs are missing spaces #15
Comments
You're not missing anything! Hypothesmith is still a proof-of-concept, and while it's already in use (and finding bugs) for Black and LibCST there's plenty of low-hanging fruit to improve it. To quote the README:
Both approaches only target syntactic validity - the code can be parsed and compiled, but is otherwise utter nonsense. Producing semantically-valid code, where you can safely execute it with few or no uncaught exceptions, is "just" a matter of a few weeks of engineering time to express all the constraints (and swarm-testing optimisations etc). On the upside, that would allow CSmith-style differential testing of Python implementations or performance optimisations. Possibly a good topic from an internship? |
Zac, maybe you missed my point? I am merely suggesting that you are not emitting spaces around NAME tokens, so that when you generate "A or A" it comes out as "AorA" -- which parses as a single identifier. I haven't actually looked at your code, but it shouldn't be too hard to fix this, should it? The simplest thing would be to just add a space after each token. Anyway, from your description it appears that from_node() and from_grammar() use completely different mechanisms to generate code? I missed that in the docs. :-( |
An idea for following the true grammar: maybe we can augment the PEG parser generator we use for Python 3.9 and newer to spit out the grammar definition in a way that's consumable by hypothesmith (or maybe lark?). |
This is a known problem, but fixing it has not yet had the best cost/benefit tradeoff - I could solve it either by improving the grammar I'm using or by adding HypothesisWorks/hypothesis#2437. On the other hand spaces around
This would be excellent. Ideally this would also allow hypothesmith to generate strings valid for 3.9 vs 3.10 vs 3.11 independent of the version that it's running under. Supporting experimental grammar changes would also be nice, I imagine 😁 |
Okay, then you need to clearly document that from_grammar() is crap. :-) |
Also, generating strings from the grammar means that you can never detect
bugs in the grammar itself (which is complex enough that it probably has
bugs). And the PEG grammar we use doesn't drill down to the character level
-- tokenization is still a separate pass, so turning a list of tokens into
a string still requires a strategy for inserting whitespace that's smarter
than "reject programs that don't parse" (since that would reject most
programs if you squoosh all tokens together).
|
It's already documented as pre-alpha! As a proof of concept, by my usual standards
Yep, and the Let me try a couple of stupid-but-fast ideas for deterministically adding spaces... [edit] nope, only works in <3% of cases, at least with the stupid approach. And smarter approaches look a like building on a different foundation, either LibCST (which tracks required whitespace locally) or custom logic on top of the PEG grammar. |
I’d appreciate that. To me, there still is a huge difference between the “obvious” bug of not inserting spaces, and the subtler issues of bias. |
Hi @gvanrossum, @Zac-HD
Any news on this 🙂 ? I'm trying to test my AST visitor on generated Python code, but I'm hitting @Zac-HD do you still have plans for this library? Could I be of any help? |
Since I've taken leave from my PhD to work at Anthropic, I'm not actively working on this at the moment - but I'd be delighted to provide design advice and code review if you're interested in contributing, and that would be really helpful! The obvious place to start is in https://github.com/Zac-HD/hypothesmith/blob/master/src/hypothesmith/cst.py: If you're interested, feel free to ask any other questions or just jump in with a tiny (one-node-type) first PR! As a general rule, we want to start by making "leaf" nodes more efficient; because an efficient "branch" or "trunk" node can lead to much more rejection sampling overall otherwise. |
After reading your comment a second time and carefully looking at the REGISTERED list and how it's use in the loop below, it all makes sense 🙂 Since I already played with each type of node |
Pinging a recent maintainer of libcst: @zsol, do you have any advice on how to tackle this? To try and summarize, the goal is to create Hypothesis strategies that build libcst nodes. Maybe there are important things to know? Don't hesitate to ping someone else 😄 |
I've never personally written a hypothesis strategy so not sure I'll be of much help, but if you feel like something is missing in libcst itself that would help, I'm all for reviewing PRs to that end. Generally CST nodes are supposed to be "easy" to construct by hand, and there's a method to validate syntactical correctness (each node having their own rules, naturally) when types can't fully express the syntactical structure. I expect that generating trees that adhere to the validation rules will help make the strategy more efficient (by e.g. not even considering nodes with imbalanced parenthesis around them) |
Thanks for the quick reply. I see the |
Nevermind, I've taken the full list of nodes and am ordering them by categories (binary ops, unary ops, boolean ops, etc.). This is quite straightforward, I should be able to answer these questions by myself 🙂 |
|
Just to save it: List of nodes by category (wip) # bases
[libcst.BaseAssignTargetExpression,],
[libcst.BaseAugOp,],
[libcst.BaseBinaryOp,],
[libcst.BaseBooleanOp,],
[libcst.BaseComp,],
[libcst.BaseCompOp,],
[libcst.BaseCompoundStatement,],
[libcst.BaseDelTargetExpression,],
[libcst.BaseDict,],
[libcst.BaseDictElement,],
[libcst.BaseElement,],
[libcst.BaseExpression,],
[libcst.BaseFormattedStringContent,],
[libcst.BaseList,],
[libcst.BaseNumber,],
[libcst.BaseParenthesizableWhitespace,],
[libcst.BaseSet,],
[libcst.BaseSimpleComp,],
[libcst.BaseSlice,],
[libcst.BaseSmallStatement,],
[libcst.BaseStatement,],
[libcst.BaseString,],
[libcst.BaseSuite,],
[libcst.BaseUnaryOp,],
# binary ops
[libcst.BinaryOperation,],
[libcst.Add,],
[libcst.AddAssign,],
[libcst.Subtract,],
[libcst.SubtractAssign,],
[libcst.Multiply,],
[libcst.MultiplyAssign,],
[libcst.Divide,],
[libcst.DivideAssign,],
[libcst.FloorDivide,],
[libcst.FloorDivideAssign,],
[libcst.Power,],
[libcst.PowerAssign,],
[libcst.Modulo,],
[libcst.ModuloAssign,],
[libcst.MatrixMultiply,],
[libcst.MatrixMultiplyAssign,],
[libcst.BitAnd,],
[libcst.BitAndAssign,],
[libcst.BitOr,],
[libcst.BitOrAssign,],
[libcst.BitXor,],
[libcst.BitXorAssign,],
[libcst.LeftShift,],
[libcst.LeftShiftAssign,],
[libcst.RightShift,],
[libcst.RightShiftAssign,],
# unary ops
[libcst.UnaryOperation,],
[libcst.BitInvert,],
[libcst.Minus,],
[libcst.Plus,],
# comparisons
[libcst.Equal,],
[libcst.NotEqual, st.just("!=")],
[libcst.GreaterThan,],
[libcst.GreaterThanEqual,],
[libcst.LessThan,],
[libcst.LessThanEqual,],
# boolean ops
[libcst.BooleanOperation,],
[libcst.And,],
[libcst.Or,],
[libcst.Not,],
# identity
[libcst.Is,],
[libcst.IsNot, infer, nonempty_whitespace, infer],
# membership
[libcst.In,],
[libcst.NotIn, infer, nonempty_whitespace, infer],
# built-in types
[libcst.Float,],
[libcst.Integer,],
[libcst.Imaginary,],
[libcst.SimpleString,],
# strings
[libcst.ConcatenatedString,],
[libcst.FormattedString,],
[libcst.FormattedStringExpression,],
[libcst.FormattedStringText,],
# built-in structures
[libcst.Dict,],
[libcst.List,],
[libcst.Set, nonempty_seq(libcst.Element, libcst.StarredElement)],
[libcst.Tuple,],
# non-whitespace tokens
[libcst.Comma,],
[libcst.Colon,],
[libcst.Dot,],
[libcst.LeftCurlyBrace,],
[libcst.LeftParen,],
[libcst.LeftSquareBracket,],
[libcst.RightCurlyBrace,],
[libcst.RightParen,],
[libcst.RightSquareBracket,],
[libcst.ParenthesizedWhitespace,],
[libcst.Semicolon,],
# whitespace tokens
[libcst.EmptyLine, infer, infer, infer],
[libcst.Newline,],
[libcst.SimpleWhitespace,],
[libcst.TrailingWhitespace, infer, infer],
# expressions
[libcst.Call,],
[libcst.Lambda,],
[libcst.Expr,],
[libcst.Slice,],
# keywords
[libcst.Del,],
[libcst.Assert,],
[libcst.Break,],
[libcst.Continue,],
[libcst.Ellipsis,],
[libcst.Pass,],
[libcst.Raise,],
[libcst.Return,],
[libcst.Yield,],
# conditions
[libcst.If,],
[libcst.Else,],
[libcst.IfExp,],
# loops
[libcst.For,],
[libcst.While,],
# match statements
[libcst.Match,],
[libcst.MatchAs,],
[libcst.MatchCase,],
[libcst.MatchClass,],
[libcst.MatchKeywordElement,],
[libcst.MatchList,],
[libcst.MatchMapping,],
[libcst.MatchMappingElement,],
[libcst.MatchOr,],
[libcst.MatchOrElement,],
[libcst.MatchPattern,],
[libcst.MatchSequence,],
[libcst.MatchSequenceElement,],
[libcst.MatchSingleton,],
[libcst.MatchStar,],
[libcst.MatchTuple,],
[libcst.MatchValue,],
# generators/comprehensions
[libcst.GeneratorExp,],
[libcst.CompFor,],
[libcst.CompIf,],
[libcst.DictComp,],
[libcst.ListComp,],
[libcst.SetComp,],
# exceptions
[libcst.Try,],
[libcst.ExceptStarHandler,],
[libcst.Finally,],
# imports
[libcst.From,],
[libcst.ImportAlias,],
[libcst.ImportStar,],
# functions/parameters
[libcst.FunctionDef,],
[libcst.Param,],
[libcst.Parameters,],
[libcst.ParamSlash,],
[libcst.ParamStar,],
# elements (?)
[libcst.Element,],
[libcst.DictElement,],
[libcst.StarredElement,],
[libcst.SubscriptElement,],
# comments
[libcst.Comment,],
# ??
[libcst.Annotation,],
[libcst.Arg,],
[libcst.AssignEqual,],
[libcst.AssignTarget,],
[libcst.AugAssign,],
[libcst.ClassDef,],
[libcst.ComparisonTarget,],
[libcst.Index,],
[libcst.Module],
[libcst.Name,],
[libcst.NameItem,],
[libcst.SimpleStatementLine,],
[libcst.SimpleStatementSuite,],
[libcst.StarredDictElement,],
[libcst.TryStar,],
[libcst.WithItem,], I believe I can start by adding the built-in types nodes (like integer)! edit: Ah, no, integers and floats are already registered 😊 |
I must be missing something simple. I run this program:
This prints examples like these:
I am pretty sure it meant
A or A
, notAorA
. (I saw more similar example in other runs and variations of the program.)It also occasionally prints a traceback and this error:
This is Python 3.9.2 on Windows.
I figure I'm doing something wrong or not understanding something?
The text was updated successfully, but these errors were encountered: