Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track the absolute position in the drivers of Parser #2861

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adithyaov
Copy link
Member

No description provided.

@adithyaov adithyaov force-pushed the track-abs-count-parser branch from 38f5c2c to 4a9e2e6 Compare October 10, 2024 11:18
@adithyaov adithyaov changed the base branch from master to parserk-driver-bug-fixes October 10, 2024 11:19
@adithyaov adithyaov linked an issue Oct 10, 2024 that may be closed by this pull request
@adithyaov
Copy link
Member Author

adithyaov commented Oct 10, 2024

In the stream function drivers, the abs-position argument is at the end. We should put it as the first argument just because stream and the stream constructors do the same.

Edit: Whatever is there should be fine.

@adithyaov adithyaov force-pushed the track-abs-count-parser branch 2 times, most recently from 0719551 to 7bcca5b Compare October 11, 2024 08:45
@adithyaov adithyaov changed the base branch from parserk-driver-bug-fixes to master October 11, 2024 08:45
@adithyaov adithyaov force-pushed the track-abs-count-parser branch 3 times, most recently from 3f1949a to 57238e9 Compare October 11, 2024 12:33
@adithyaov adithyaov force-pushed the track-abs-count-parser branch from 1b81447 to 5721b6a Compare October 11, 2024 13:54
@adithyaov
Copy link
Member Author

Benchmark regressions need to be checked.

@clintonmead
Copy link

@harendra-kumar referred to this PR here: #2895 so I thought I'd throw in a few cents.

I don't completely understand this PR , but I would be hesitant about making position tracking a sort of hard coded thing. It looks like in this PR "position" is just the number of tokens, but I'm not sure that's a useful definition in a lot of cases. In some cases "line and column" is more useful than just an absolute position. But even that's sometimes not what you want. In a JSON document you might want a alice->bob->array element 9->charlie style position

The following parser transformer (which I mentioned in the issue) seems to achieve position tracking anyway:

limitParserInput :: forall m a b. Applicative m => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (StreamlyParser.Parser (stepFIn :: s -> a -> m (StreamlyParser.Step s b)) (initialIn :: m (StreamlyParser.Initial s b)) (extractFIn :: s -> m (StreamlyParser.Step s b))) = 
  StreamlyParser.Parser stepF initial extractF where
    stepF :: StrictIntPair s -> (Int, a) -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
      False -> pure $ StreamlyParser.Error "Hit parser size limit" -- this should be more informative
    initial :: m (StreamlyParser.Initial (StrictIntPair s) (Int, b))
    initial = BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
    extractF :: StrictIntPair s -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    extractF (StrictIntPair currentSize state) = BiFunctor.bimap (StrictIntPair currentSize) (currentSize,) <$> extractFIn state

To achieve what I think this PR does using the above function, you'll just need to pattern match on the stepFIn function, and if it fails, just add the current position to the error message. If you don't need to logic to fail on a maxInputSize just remove the argument and the handler in the case statement.

But this has the following benefits:

  1. It doesn't introduce a cost on parsers that don't use positions AND
  2. People can define their own concepts of position

By own concepts of "position", for example limitParserInput allows one to specify the size of incoming input tokens, for example. But hey, maybe if we were doing JSON parsing I could have a stack I could push/pop new nodes on/off when we have opening { and closing }.

That being said, I'm just starting to use this library, but it seems to me that position tracking can be done with just parser transforms instead of modifying the library to hard code the logic everywhere, but I may be missing something.

@adithyaov
Copy link
Member Author

adithyaov commented Dec 12, 2024

Hi @clintonmead , thank you for the detailed comment.

I can see one problem with the implementation of limitParserInput. It does not
account for any backtracking.

In the current implementation of parsers, the backtracking information is stored
in the Step and the number representing backtracking is in terms of the number
of tokens.

Extending your example, (I did not compile and check)

limitParserInput
    :: forall m a b. Applicative m
    => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser a m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
    Parser step initial extract
    where
    initial = do
        st <- initial0
        pure $ case st of
            IPartial s -> IPartial (0, s)
            IDone b -> IDone (0, b)
            IError _ -> IError (show 0) -- Hacky way to encode error limited by
                                        -- the implementation.

    step (i, s) inp = do
        let i1 = i + 1
        case i1 <= maxInputSize of
              True -> do
                  res <- step0 state input
                  pure $ case res of
                      Partial back s1 -> Partial back (i1 - back, s1)
                      Continue back s1 -> Continue back (i1 - back, s1)
                      Done back s1 -> Done back (i1 - back, s1)
                      Error _ -> pure $ Error $ show i1
              False -> pure $ Error $ "Hit parser size limit" -- this should be
                                                              -- more
                                                              -- informative

    extract (i, s) = do
        res <- extract0 s
        pure $ case res of
            Partial back s1 -> Partial back (i - back, s1)
            Continue back s1 -> Continue back (i - back, s1)
            Done back b -> Done back (i - back, b)
            Error _ -> Error $ show i1

Although this accounts for backtracking, we only keep track of the
number of tokens. The size of the token is 1. We cannot have a custom size for
the token.

We can however solve this by implementing a driver that accounts for the token
size. The token size will become a first-class citizen.

type ParserWithToken a m b = Parser (Int, a) m (Int, b)

The diver for this parser would keep track of the token size and the
number of tokens. While backtracking, we also subtract the token size
appropriately. This can be very useful.

On a side note: I'll need to check if I'm missing anything in limitParserInput.
If this is enough to keep track of the absolute position, then I would prefer
this over changing the parser drivers.

@adithyaov
Copy link
Member Author

A custom driver might not be required, if we can save the buffer in the parser
combinator itself, the parser combinator can do the job of the driver.

backtrackSize :: Int -> [Int] -> Int
backtrackSize i = sum . take i

backtrackBufBy :: Int -> [Int] -> Int
backtrackBufBy i = drop i

limitParserInput
    :: forall m a b. Applicative m
    => Int -> Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
    Parser step initial extract
    where
    initial = do
        st <- initial0
        pure $ case st of
            IPartial s -> IPartial (0, [], s)
            IDone b -> IDone (0, [], b)
            IError _ -> IError (show 0) -- Hacky way to encode error limited by
                                        -- the implementation.

    step (accInp, backtrackBuf, s) (inpSize, inp) = do
        let newAccInp = accInp + inpSize
            newBacktrackBuf = inpSize:backtrackBuf
        case newAccInp <= maxInputSize of
              True ->
                  res <- step0 state input
                  pure $ case res of
                      Partial back s1 ->
                          Partial back
                              ( newAccInp - backtrackSize back newBacktrackBuf
                              , backtrackBufBy back newBacktrackBuf, s1
                              )
                      Continue back s1 -> ...
                      Done back s1 -> ...
                      Error _ -> ...
              False -> pure $ Error $ "Hit parser size limit" -- this should be
                                                              -- more
                                                              -- informative

    extract (i, s) = ...

@adithyaov
Copy link
Member Author

Let me verify if this covers everything and solves the issue and your use case
properly. If so, we can implement this combinator or add a new parser driver
where the token size is a first-class citizen.

@harendra-kumar
Copy link
Member

We can evaluate both the approaches. About the implicit tracking approach in this PR:

  • The position is quite useful a lot of times, so always passing it in the driver may not be too bad if the performance impact is not significant. Since it is always there, we can always use it to report useful information about the any error position in any parser.
  • Implicit accounting of position in the driver may be easy to use, as it is not visible everywhere but is there when you need to report it. Though it may have to be manipulated in any intermediate combinators that act like drivers.
  • Unboxed parsers are the most commonly used ones. Just byte position tracking makes sense for unboxed parsers as they work on byte array inputs. For generic/boxed parsers the position would be in terms of the input tokens but anyway we cannot look inside the tokens, so that is the only sensible way to report position.

About the proposed explicit tracking:

  • This seems less invasive as we do not need to change anything, but we need to try and use it to see if there are any difficult issues with it, how ergonomic it is to use etc. But the good thing is that it can be tried without much changes to the existing implementation.
  • The position information is flexible and it can potentially be anything.
  • For error position reporting maybe we can change the Error constructor to Error Int String.
  • We need to see how this will optimize in practical cases, what are the observed perf characteristics.

Is there any other potential third way like tracking the context in an underlying state monad?

@adithyaov
Copy link
Member Author

adithyaov commented Dec 12, 2024

@clintonmead Assuming parser combinators like the above exist.
Can you think out loud a sample use case of error reporting or position
tracking?

For a CSV file, we may want to track the position by: Line Number + Part Index
An example with pseudo-code will help with seeing what the usage would look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try using absolute position in parsers instead of relative
3 participants