Skip to content

Commit dc5a540

Browse files
committed
[Parsers] End Attoparsec chapter
1 parent 6ba0595 commit dc5a540

File tree

1 file changed

+214
-1
lines changed

1 file changed

+214
-1
lines changed

en/lessons/parsers/attoparsec.md

Lines changed: 214 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,217 @@ title: Attoparsec
55

66
{% include toc.html %}
77

8-
https://hackage.haskell.org/package/attoparsec
8+
[`attoparsec`](https://hackage.haskell.org/package/attoparsec) is a well known
9+
parser combinators library, especially for its performances.
10+
11+
# Definition
12+
13+
Having a look at its definition:
14+
15+
```haskell
16+
newtype Parser i a = Parser {
17+
runParser :: forall r.
18+
State i -> Pos -> More
19+
-> Failure i (State i) r
20+
-> Success i (State i) a r
21+
-> IResult i r
22+
}
23+
24+
type family State i
25+
type instance State ByteString = B.Buffer
26+
type instance State Text = T.Buffer
27+
28+
type Failure i t r = t -> Pos -> More -> [String] -> String
29+
-> IResult i r
30+
type Success i t a r = t -> Pos -> More -> a -> IResult i r
31+
32+
-- | Have we read all available input?
33+
data More = Complete | Incomplete
34+
35+
newtype Pos = Pos { fromPos :: Int }
36+
deriving (Eq, Ord, Show, Num)
37+
38+
-- | The result of a parse. This is parameterised over the type @i@
39+
-- of string that was processed.
40+
--
41+
-- This type is an instance of 'Functor', where 'fmap' transforms the
42+
-- value in a 'Done' result.
43+
data IResult i r =
44+
Fail i [String] String
45+
-- ^ The parse failed. The @i@ parameter is the input that had
46+
-- not yet been consumed when the failure occurred. The
47+
-- @[@'String'@]@ is a list of contexts in which the error
48+
-- occurred. The 'String' is the message describing the error, if
49+
-- any.
50+
| Partial (i -> IResult i r)
51+
-- ^ Supply this continuation with more input so that the parser
52+
-- can resume. To indicate that no more input is available, pass
53+
-- an empty string to the continuation.
54+
--
55+
-- __Note__: if you get a 'Partial' result, do not call its
56+
-- continuation more than once.
57+
| Done i r
58+
-- ^ The parse succeeded. The @i@ parameter is the input that had
59+
-- not yet been consumed (if any) when the parse succeeded.
60+
```
61+
62+
It looks like `Megaparsec`, but with a far simpler (and more specialized) `State`.
63+
64+
# Running the Parser
65+
66+
Let see how to run a `Parser` (for `ByteString`):
67+
68+
```haskell
69+
parse :: Parser a -> ByteString -> Result a
70+
parse m s = T.runParser m (buffer s) (Pos 0) Incomplete failK successK
71+
72+
failK :: Failure a
73+
failK t (Pos pos) _more stack msg = Fail (Buf.unsafeDrop pos t) stack msg
74+
75+
successK :: Success a a
76+
successK t (Pos pos) _more a = Done (Buf.unsafeDrop pos t) a
77+
78+
buffer :: ByteString -> Buffer
79+
```
80+
81+
`T.runParser` is the `Parser`'s value.
82+
83+
Very simple, actually, error handling is limited to the position (`Pos`) and
84+
the current processing state, are given directly.
85+
86+
# Combinators
87+
88+
Let's have a look at some combinators, starting by the usual instances:
89+
90+
```haskell
91+
instance Applicative (Parser i) where
92+
pure v = Parser $ \t !pos more _lose succ -> succ t pos more v
93+
94+
instance Alternative (Parser i) where
95+
f <|> g = Parser $ \t pos more lose succ ->
96+
let lose' t' _pos' more' _ctx _msg = runParser g t' pos more' lose succ
97+
in runParser f t pos more lose' succ
98+
99+
instance Monad (Parser i) where
100+
m >>= k = Parser $ \t !pos more lose succ ->
101+
let succ' t' !pos' more' a = runParser (k a) t' pos' more' lose succ
102+
in runParser m t pos more lose succ'
103+
```
104+
105+
Straightforward, maybe some combinators would be more interesting:
106+
107+
```haskell
108+
satisfy :: (Word8 -> Bool) -> Parser Word8
109+
satisfy p = do
110+
h <- peekWord8'
111+
if p h
112+
then advance 1 >> return h
113+
else fail "satisfy"
114+
115+
-- | Match any byte, to perform lookahead. Does not consume any
116+
-- input, but will fail if end of input has been reached.
117+
peekWord8' :: Parser Word8
118+
peekWord8' = T.Parser $ \t pos more lose succ ->
119+
if lengthAtLeast pos 1 t
120+
then succ t pos more (Buf.unsafeIndex t (fromPos pos))
121+
else let succ' t' pos' more' bs' = succ t' pos' more' $! B.unsafeHead bs'
122+
in ensureSuspended 1 t pos more lose succ'
123+
124+
advance :: Int -> Parser ()
125+
advance n = T.Parser $ \t pos more _lose succ ->
126+
succ t (pos + Pos n) more ()
127+
```
128+
129+
Here comes the interesting part: we check the head, if it works, we increment
130+
the position, and left the input untouched.
131+
132+
We can easily unterstand the reason behind `attoparsec`'s speed: a basic error
133+
handling, and a small state.
134+
135+
# Zepto
136+
137+
`attoparsec` comes with another parser combinators type: [`Zepto`](https://hackage.haskell.org/package/attoparsec-0.14.3/docs/Data-Attoparsec-Zepto.html):
138+
139+
> A tiny, highly specialized combinator parser for `ByteString` strings.
140+
>
141+
> While the main attoparsec module generally performs well, this module is particularly fast for simple non-recursive loops that should not normally result in failed parses.
142+
143+
Let's have a look:
144+
145+
```haskell
146+
-- | A simple parser.
147+
--
148+
-- This monad is strict in its state, and the monadic bind operator
149+
-- ('>>=') evaluates each result to weak head normal form before
150+
-- passing it along.
151+
newtype ZeptoT m a = Parser {
152+
runParser :: S -> m (Result a)
153+
}
154+
155+
type Parser a = ZeptoT Identity a
156+
157+
newtype S = S {
158+
input :: ByteString
159+
}
160+
161+
data Result a = Fail String
162+
| OK !a S
163+
```
164+
165+
Definitively the simplest parser combinator you can come up with.
166+
167+
```haskell
168+
instance (Monad m) => Applicative (ZeptoT m) where
169+
pure a = Parser $ \s -> return (OK a s)
170+
171+
instance Monad m => Alternative (ZeptoT m) where
172+
empty = fail "empty"
173+
174+
a <|> b = Parser $ \s -> do
175+
result <- runParser a s
176+
case result of
177+
ok@(OK _ _) -> return ok
178+
_ -> runParser b s
179+
180+
instance Monad m => Monad (ZeptoT m) where
181+
m >>= k = Parser $ \s -> do
182+
result <- runParser m s
183+
case result of
184+
OK a s' -> runParser (k a) s'
185+
Fail err -> return (Fail err)
186+
```
187+
188+
It looks tedious because you deal with `Result`, but, in the end, you have all
189+
the boilerplate needed to create a function stored in `Parser`.
190+
191+
```haskell
192+
parseT :: Monad m => ZeptoT m a -> ByteString -> m (Either String a)
193+
parseT p bs = do
194+
result <- runParser p (S bs)
195+
case result of
196+
OK a _ -> return (Right a)
197+
Fail err -> return (Left err)
198+
```
199+
200+
It comes with the following primitives:
201+
202+
```haskell
203+
gets :: Monad m => (S -> a) -> ZeptoT m a
204+
gets f = Parser $ \s -> return (OK (f s) s)
205+
206+
put :: Monad m => S -> ZeptoT m ()
207+
put s = Parser $ \_ -> return (OK () s)
208+
```
209+
210+
With that, are defined the only available functions: `takeWhile`, `take`, `string`, `atEnd`.
211+
212+
# Conclusion
213+
214+
`attoparsec` clearly focuses on performances, saying that, it brings two important points:
215+
216+
* It reaches performances by having the simplest design possible, sticking to its internals (mostly `ByteString`)
217+
* It comes with a second abstraction, which push these principles to the extreme
218+
219+
Interestingly, we could assume that, in order to acheive good performances,
220+
the code would be more cryptic, while it's the simplest implementation we have
221+
seen so far.

0 commit comments

Comments
 (0)