Track the absolute position in the drivers of Parser #2861

adithyaov · 2024-10-08T09:54:45Z

No description provided.

adithyaov · 2024-10-10T11:25:24Z

In the stream function drivers, the abs-position argument is at the end. We should put it as the first argument just because stream and the stream constructors do the same.

Edit: Whatever is there should be fine.

adithyaov · 2024-10-12T07:32:44Z

Benchmark regressions need to be checked.

clintonmead · 2024-12-11T02:54:13Z

@harendra-kumar referred to this PR here: #2895 so I thought I'd throw in a few cents.

I don't completely understand this PR , but I would be hesitant about making position tracking a sort of hard coded thing. It looks like in this PR "position" is just the number of tokens, but I'm not sure that's a useful definition in a lot of cases. In some cases "line and column" is more useful than just an absolute position. But even that's sometimes not what you want. In a JSON document you might want a alice->bob->array element 9->charlie style position

The following parser transformer (which I mentioned in the issue) seems to achieve position tracking anyway:

limitParserInput :: forall m a b. Applicative m => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (StreamlyParser.Parser (stepFIn :: s -> a -> m (StreamlyParser.Step s b)) (initialIn :: m (StreamlyParser.Initial s b)) (extractFIn :: s -> m (StreamlyParser.Step s b))) = 
  StreamlyParser.Parser stepF initial extractF where
    stepF :: StrictIntPair s -> (Int, a) -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
      False -> pure $ StreamlyParser.Error "Hit parser size limit" -- this should be more informative
    initial :: m (StreamlyParser.Initial (StrictIntPair s) (Int, b))
    initial = BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
    extractF :: StrictIntPair s -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    extractF (StrictIntPair currentSize state) = BiFunctor.bimap (StrictIntPair currentSize) (currentSize,) <$> extractFIn state

To achieve what I think this PR does using the above function, you'll just need to pattern match on the stepFIn function, and if it fails, just add the current position to the error message. If you don't need to logic to fail on a maxInputSize just remove the argument and the handler in the case statement.

But this has the following benefits:

It doesn't introduce a cost on parsers that don't use positions AND
People can define their own concepts of position

By own concepts of "position", for example limitParserInput allows one to specify the size of incoming input tokens, for example. But hey, maybe if we were doing JSON parsing I could have a stack I could push/pop new nodes on/off when we have opening { and closing }.

That being said, I'm just starting to use this library, but it seems to me that position tracking can be done with just parser transforms instead of modifying the library to hard code the logic everywhere, but I may be missing something.

adithyaov · 2024-12-12T08:52:49Z

Hi @clintonmead , thank you for the detailed comment.

I can see one problem with the implementation of limitParserInput. It does not
account for any backtracking.

In the current implementation of parsers, the backtracking information is stored
in the Step and the number representing backtracking is in terms of the number
of tokens.

Extending your example, (I did not compile and check)

limitParserInput
    :: forall m a b. Applicative m
    => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser a m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
    Parser step initial extract
    where
    initial = do
        st <- initial0
        pure $ case st of
            IPartial s -> IPartial (0, s)
            IDone b -> IDone (0, b)
            IError _ -> IError (show 0) -- Hacky way to encode error limited by
                                        -- the implementation.

    step (i, s) inp = do
        let i1 = i + 1
        case i1 <= maxInputSize of
              True -> do
                  res <- step0 state input
                  pure $ case res of
                      Partial back s1 -> Partial back (i1 - back, s1)
                      Continue back s1 -> Continue back (i1 - back, s1)
                      Done back s1 -> Done back (i1 - back, s1)
                      Error _ -> pure $ Error $ show i1
              False -> pure $ Error $ "Hit parser size limit" -- this should be
                                                              -- more
                                                              -- informative

    extract (i, s) = do
        res <- extract0 s
        pure $ case res of
            Partial back s1 -> Partial back (i - back, s1)
            Continue back s1 -> Continue back (i - back, s1)
            Done back b -> Done back (i - back, b)
            Error _ -> Error $ show i1

Although this accounts for backtracking, we only keep track of the
number of tokens. The size of the token is 1. We cannot have a custom size for
the token.

We can however solve this by implementing a driver that accounts for the token
size. The token size will become a first-class citizen.

type ParserWithToken a m b = Parser (Int, a) m (Int, b)

The diver for this parser would keep track of the token size and the
number of tokens. While backtracking, we also subtract the token size
appropriately. This can be very useful.

On a side note: I'll need to check if I'm missing anything in limitParserInput.
If this is enough to keep track of the absolute position, then I would prefer
this over changing the parser drivers.

adithyaov · 2024-12-12T09:02:43Z

A custom driver might not be required, if we can save the buffer in the parser
combinator itself, the parser combinator can do the job of the driver.

backtrackSize :: Int -> [Int] -> Int
backtrackSize i = sum . take i

backtrackBufBy :: Int -> [Int] -> Int
backtrackBufBy i = drop i

limitParserInput
    :: forall m a b. Applicative m
    => Int -> Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
    Parser step initial extract
    where
    initial = do
        st <- initial0
        pure $ case st of
            IPartial s -> IPartial (0, [], s)
            IDone b -> IDone (0, [], b)
            IError _ -> IError (show 0) -- Hacky way to encode error limited by
                                        -- the implementation.

    step (accInp, backtrackBuf, s) (inpSize, inp) = do
        let newAccInp = accInp + inpSize
            newBacktrackBuf = inpSize:backtrackBuf
        case newAccInp <= maxInputSize of
              True ->
                  res <- step0 state input
                  pure $ case res of
                      Partial back s1 ->
                          Partial back
                              ( newAccInp - backtrackSize back newBacktrackBuf
                              , backtrackBufBy back newBacktrackBuf, s1
                              )
                      Continue back s1 -> ...
                      Done back s1 -> ...
                      Error _ -> ...
              False -> pure $ Error $ "Hit parser size limit" -- this should be
                                                              -- more
                                                              -- informative

    extract (i, s) = ...

adithyaov · 2024-12-12T09:09:59Z

Let me verify if this covers everything and solves the issue and your use case
properly. If so, we can implement this combinator or add a new parser driver
where the token size is a first-class citizen.

harendra-kumar · 2024-12-12T09:55:44Z

We can evaluate both the approaches. About the implicit tracking approach in this PR:

The position is quite useful a lot of times, so always passing it in the driver may not be too bad if the performance impact is not significant. Since it is always there, we can always use it to report useful information about the any error position in any parser.
Implicit accounting of position in the driver may be easy to use, as it is not visible everywhere but is there when you need to report it. Though it may have to be manipulated in any intermediate combinators that act like drivers.
Unboxed parsers are the most commonly used ones. Just byte position tracking makes sense for unboxed parsers as they work on byte array inputs. For generic/boxed parsers the position would be in terms of the input tokens but anyway we cannot look inside the tokens, so that is the only sensible way to report position.

About the proposed explicit tracking:

This seems less invasive as we do not need to change anything, but we need to try and use it to see if there are any difficult issues with it, how ergonomic it is to use etc. But the good thing is that it can be tried without much changes to the existing implementation.
The position information is flexible and it can potentially be anything.
For error position reporting maybe we can change the Error constructor to Error Int String.
We need to see how this will optimize in practical cases, what are the observed perf characteristics.

Is there any other potential third way like tracking the context in an underlying state monad?

adithyaov · 2024-12-12T10:24:10Z

@clintonmead Assuming parser combinators like the above exist.
Can you think out loud a sample use case of error reporting or position
tracking?

For a CSV file, we may want to track the position by: Line Number + Part Index
An example with pseudo-code will help with seeing what the usage would look like.

adithyaov mentioned this pull request Oct 8, 2024

Try using absolute position in parsers instead of relative #2500

Open

adithyaov force-pushed the track-abs-count-parser branch from 38f5c2c to 4a9e2e6 Compare October 10, 2024 11:18

adithyaov changed the base branch from master to parserk-driver-bug-fixes October 10, 2024 11:19

adithyaov linked an issue Oct 10, 2024 that may be closed by this pull request

Try using absolute position in parsers instead of relative #2500

Open

adithyaov force-pushed the track-abs-count-parser branch 2 times, most recently from 0719551 to 7bcca5b Compare October 11, 2024 08:45

adithyaov changed the base branch from parserk-driver-bug-fixes to master October 11, 2024 08:45

adithyaov force-pushed the track-abs-count-parser branch 3 times, most recently from 3f1949a to 57238e9 Compare October 11, 2024 12:33

Track the absolute position in the drivers of Parser

5721b6a

adithyaov force-pushed the track-abs-count-parser branch from 1b81447 to 5721b6a Compare October 11, 2024 13:54

harendra-kumar mentioned this pull request Dec 10, 2024

Question (and perhaps suggestion): transforming parsers to limit input size, the "proper way"? #2895

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track the absolute position in the drivers of Parser #2861

Track the absolute position in the drivers of Parser #2861

adithyaov commented Oct 8, 2024

adithyaov commented Oct 10, 2024 •

edited

Loading

adithyaov commented Oct 12, 2024

clintonmead commented Dec 11, 2024

adithyaov commented Dec 12, 2024 •

edited

Loading

adithyaov commented Dec 12, 2024

adithyaov commented Dec 12, 2024

harendra-kumar commented Dec 12, 2024

adithyaov commented Dec 12, 2024 •

edited

Loading

Track the absolute position in the drivers of Parser #2861

Are you sure you want to change the base?

Track the absolute position in the drivers of Parser #2861

Conversation

adithyaov commented Oct 8, 2024

adithyaov commented Oct 10, 2024 • edited Loading

adithyaov commented Oct 12, 2024

clintonmead commented Dec 11, 2024

adithyaov commented Dec 12, 2024 • edited Loading

adithyaov commented Dec 12, 2024

adithyaov commented Dec 12, 2024

harendra-kumar commented Dec 12, 2024

adithyaov commented Dec 12, 2024 • edited Loading

adithyaov commented Oct 10, 2024 •

edited

Loading

adithyaov commented Dec 12, 2024 •

edited

Loading

adithyaov commented Dec 12, 2024 •

edited

Loading