Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question (and perhaps suggestion): transforming parsers to limit input size, the "proper way"? #2895

Open
clintonmead opened this issue Dec 9, 2024 · 4 comments

Comments

@clintonmead
Copy link

So I want to do some bounded space parsing. I'm collecting a bunch of tokens (in my case Word8, but it doesn't really matter) and combining them into a new token.

I wanted to just write my parser logic independently, and then "wrap" the parser in a way that transforms it into a new parser that fails if the input length is too large before generating a token. I don't want this "limit" logic to be part of every parser I write, I want to separate these concerns.

So I came up with something like this (excuse the verboseness, I've been quite explicit about the types):

import qualified Streamly.Internal.Data.Parser as StreamlyParser

-- Little optimisation so the state remains strict and getting the count doesn't need following a pointer.
-- This is not exposed in the interface of the function
data StrictIntPair a = StrictIntPair {-# UNPACK #-} !Int !a

limitParserInput :: forall m a b. (Applicative m) => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (StreamlyParser.Parser (stepFIn :: s -> a -> m (StreamlyParser.Step s b)) (initialIn :: m (StreamlyParser.Initial s b)) (extractFIn :: s -> m (StreamlyParser.Step s b))) = 
  StreamlyParser.Parser stepF initial extractF where
    stepF :: StrictIntPair s -> (Int, a) -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
      False -> pure $ StreamlyParser.Error "Hit parser size limit" -- this should be more informative
    initial :: m (StreamlyParser.Initial (StrictIntPair s) (Int, b))
    initial = BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
    extractF :: StrictIntPair s -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    extractF (StrictIntPair currentSize state) = BiFunctor.bimap (StrictIntPair currentSize) (currentSize,) <$> extractFIn state

So, basically, given a Parser a m b, I can generate a Parser (Int, a) m (Int, b) which means I can give a "size" to all my input elements, and set a "max size" for the output.

So far so good. Well almost. The issue is that if Parser (Int, a) m (Int, b) fails, I just get back a string, but if the underlying Parser a m b fails, I also just get a string. Short of hacky pattern matching on the string, I don't know where each one came from.

So then I tried this, instead writing a Parser wrapper that never fails, but instead throws it's error in ExceptT. Now I can type the error:

data StreamSizeLimitError = StreamSizeLimitError -- this should probably have some parameters detailing the error

limitParserInputTypedErr :: forall m a b. (Monad m) => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser (Int, a) (ExceptT StreamSizeLimitError m) (Int, b)
limitParserInputTypedErr maxInputSize (StreamlyParser.Parser (stepFIn :: s -> a -> m (StreamlyParser.Step s b)) (initialIn :: m (StreamlyParser.Initial s b)) (extractFIn :: s -> m (StreamlyParser.Step s b))) = 
  StreamlyParser.Parser stepF initial extractF where
    stepF :: StrictIntPair s -> (Int, a) -> ExceptT StreamSizeLimitError m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> lift $ BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
      False -> throwE StreamSizeLimitError -- this should be more informative
    initial :: ExceptT StreamSizeLimitError m (StreamlyParser.Initial (StrictIntPair s) (Int, b))
    initial = lift $ BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
    extractF :: StrictIntPair s -> ExceptT StreamSizeLimitError m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    extractF (StrictIntPair currentSize state) = lift $ BiFunctor.bimap (StrictIntPair currentSize) (currentSize,) <$> extractFIn state

But now I've got a parser transformer that only fails if the underlying parser fails. But because the transformer itself doesn't introduce failures, maybe I could write this as a fold transformer. And I can:

import qualified Streamly.Internal.Data.Fold as StreamlyFold

limitFoldInput :: forall m a b. (Monad m) => Int -> StreamlyFold.Fold m a b -> StreamlyFold.Fold (ExceptT StreamSizeLimitError m) (Int, a) (Int, b)
limitFoldInput maxInputSize (StreamlyFold.Fold (stepIn :: s -> a -> m (StreamlyFold.Step s b)) (initialIn :: m (StreamlyFold.Step s b)) (extractIn :: s -> m b) (extractFinal :: s -> m b)) = 
  StreamlyFold.Fold step initial extract finish where
    step :: StrictIntPair s -> (Int, a) -> ExceptT StreamSizeLimitError m (StreamlyFold.Step (StrictIntPair s) (Int, b))
    step (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> lift $ BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepIn state input
      False -> throwE StreamSizeLimitError -- this should be more informative
    initial :: ExceptT StreamSizeLimitError m (StreamlyFold.Step (StrictIntPair s) (Int, b))
    initial = lift $ BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
    extract :: StrictIntPair s -> ExceptT StreamSizeLimitError m (Int, b)
    extract (StrictIntPair size state) = lift $ (size,) <$> extractFinal state
    finish :: StrictIntPair s -> ExceptT StreamSizeLimitError m (Int, b)
    finish (StrictIntPair size state) = lift $ (size,) <$> extractFinal state 

But now I can only use this to transform folds. Although I feel like there should be a function like:

mapFoldToParser :: (Fold m a b -> Fold m c d) -> Parser a m b -> Parser c m d

Although I haven't written one myself to prove it exists, and I couldn't find it in the docs.

So I think I've just got a bunch of vague questions, and would appreciate some guidance:

  1. Am I totally going about this in the wrong way for what I want to achieve?
  2. Am I completely destroying my efficiency by moving to ExceptT, and are there any sneaky space leak issues I need to watch out for?
  3. Should perhaps Parser a m b be instead Parser e a m b, and the constructor Error String be instead Error e? This would allow one to write (I think):
    limitParserInput :: (Applicative m) => Int -> Parser e a m b -> Parser (StreamSizeLimitError, e) (Int, a) m (Int, b) and perhaps this is better for efficiency/ergonomics?

My apologies for the somewhat verbose question, I just thought the best way to not completely go down the wrong rabbit hole was just to put it all out there the way I've been going.

@harendra-kumar
Copy link
Member

A fold can be converted to a parser without any problem, but a parser cannot be converted to a fold because we will lose the error. It is a one way street. Therefore, a function like mapFoldToParser cannot be written. So that approach won't work.

Parsers were designed to just succeed or fail without any type level distinction on why it failed. Failure is used for the Alternative implementation, if a parser fails for whatever reason we can opt to use an alternative parser. But we cannot make that decision based on the contents of the error. Parameterizing the Parser type with an error type would be a bigger change.

However, the wrapper parsers are free to handle the underlying parser's error in whatever way.
In your function here:

    stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> lift $ BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
      False -> throwE StreamSizeLimitError -- this should be more informative

You can case match on the inner parser result, handle the error case and transform the error string as you want. Thus the contents of the error string can be determined by this outer parser and not the inner one. You can tag the error with some keyword to indicate whether the error is due to inner parser failure or size overflow. The error would still be a String but you control the format of the string, so an outer parser can handle it deterministically if required.

@clintonmead
Copy link
Author

clintonmead commented Dec 10, 2024

Thanks for your quick reply @harendra-kumar!

Would you be open to a PR to add an error type to Parser?

Something like the following:

data Parser' e a m b =
  forall s. Parser
    (s -> a -> m (Step' e s b))
    (m (Initial s b))
    (s -> m (Step' e s b))

type Parser = Parser' String

data Step e s b
    = Partial !Int !s
    | Continue !Int !s
    | Done !Int !b
    | Error !e

data Initial e s b
    = IPartial !s
    | IDone !b
    | IError !e

And replacing all wherever one sees Parser with Parser' e?

I think this will need some further changes also. Some functions could just have their signature changed. For example:

fromFoldMaybe :: Monad m => String -> Fold m a (Maybe b) -> Parser a m b

just needs to be changed to:

fromFoldMaybe :: Monad m => e -> Fold m a (Maybe b) -> Parser' e a m b

But wherever there is a hardcoded string in an Error constructor, one will need to write a new function. For example:

peek :: Monad m => Parser a m a
peek = Parser step initial extract
    where
    initial = return $ IPartial ()
    step () a = return $ Done 1 a
    extract () = return $ Error "peek: end of input"

I think the best way to change it will be like this:

mapError :: (e1 -> e2) -> Parser e1 a m b -> Parser e2 a m b
mapError f (Parser step initial extract) = Parser (mapStepError step) (mapInitialError initial) extract where
  mapStepError = \case
    Error e -> Error (f e)
    x -> x
  mapInitialError = \case
    IError e -> IError (f e)
    x -> x

data SatisfyError = SatisfyErrorPredicateFailed | SatisfyErrorEndOfInput

satisfy' :: Monad m => (a -> Bool) -> Parser' SatisfyError a m a
satisfy' predicate = Parser step initial extract where
    initial = return $ IPartial ()
    step () a = return $
        if predicate a
        then Done 0 a
        else Error SatisfyErrorPredicateFailed
    extract () = return $ Error SatisfyErrorEndOfInput

satisfy :: Monad m => (a -> Bool) -> Parser a m a
satisfy = mapError errorMapping satisfy where
  errorMapping SatisfyErrorPredicateFailed = "satisfy: predicate failed"
  errorMapping SatisfyErrorEndOfInput = "satisfy: end of input"

but I'm not sure if the mapError call will be inlined and run at compile time, transforming satisfy' into the original satisfy. There's this alternative:

satisfy' :: Monad m => e -> e -> (a -> Bool) -> Parser' e a m a
satisfy' predicateFailedError endOfInputError predicate = Parser step initial extract where
    initial = return $ IPartial ()
    step () a = return $
        if predicate a
        then Done 0 a
        else Error predicateFailedError
    extract () = return $ Error endOfInputError

satisfy :: Monad m => (a -> Bool) -> Parser a m a
satisfy = satisfy "satisfy: predicate failed" "satisfy: end of input"

Perhaps this interferes less with optimisations, and the compiler can more easily recover the original satisfy code by just inlining the strings into satisfy'.

I ask all this because whilst String might do in the short term, I want to do things like collect line numbers and perhaps multiple errors at the same time, and serve this up via an API so that a frontend can display any parse errors in a sensible format to a potentially non-technical end user. Whilst doing all of this via string manipulation is possible it's very perl like and I would imagine hacky, and kind of defeats the purpose of using Haskell.

I'm happy to work on a PR myself to achieve this (I suspect it will be in the new year sometime) but I just wanted to check with you whether you thought this was a reasonable thing to do before I dive in.

@harendra-kumar
Copy link
Member

I am not averse to the idea, however, we need to think about a few things:

  • Clearly define some concrete use cases and what in general we want to achieve with it
  • How other libraries are dealing with the same problem?
  • Are there any other simpler ways to deal with the problem?
  • We need to carefully think about the trade-off between what new use cases we are serving vs ease-of-use of the library, is it worth the change? The library should not become intimidating to use by majority of users.
  • Keep backward compatibility as much as possible with minimal breaking changes
  • Performance impact is very important

For position tracking, line number reporting we have another change planned, see the PR #2861 . Basically we will have the stream position available everywhere which can be used to compute the location including line numbers. That PR has some performance impact, we need to see if we want to absorb the perf impact or have position tracking parsers separated from non-tracking ones.

Given that we can have position tracking separately, we need to see what additional benefits we will have with parameterized error. It can possibly be used to compose errors in a better way, but I would like some good, practical end user benefits listed down.

Thanks for proposing to work on this. Again, I am not against the idea but want to whet out pros and cons thoroughly before deciding to merge. A minimal prototype can bring out how complicated things become with this, and what we are getting into.

@clintonmead
Copy link
Author

I'm actually doing some custom CSV parsing.

I've looked on Hackage and I can't find a library that does all of the below:

  1. Not do silly things like call error
  2. Actually parses in a streaming manner in constant space (some get close, but technically can be blown out of the water with bad user input, like a REALLY big cell, and there's no real way to inject a short circuit error in that case).
  3. Has a nice interface to do the transformations we need to do (we need to reorder columns whilst streaming for example).

So I figure it would be less work to write something from scratch than try to completely change the architecture of an existing library.

I think I'll just get something working with streamly as it stands, keeps the boss happy in the short term. But part of the point of this exercise is to allow users to fix any issues in their uploaded CSVs, which could go beyond just being a valid CSV (missing data etc). In that case I'd probably want a structured error type.

So then I might try to patch your library, and try to convince the boss it allow me to open source the core of the CSV parsing library I'm writing, so you can see the PR and a use case together. That's likely only to be a new year thing though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants