forked from doublec/factor-articles
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathpegs.tex
218 lines (177 loc) · 6.5 KB
/
pegs.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
\chapter{Parsing Expression Grammars}\label{peg}
I'm working on a new parser combinator library for Factor based on
Parsing Expression Grammars and Packrat parsers. This is based on what
I learnt from writing a packrat parser in Javascript.
It's progressing quite well and already fixes some problems in the
existing parser combinator library I wrote. The main issue with that
one is it's not tail recursive and some combinations of parsers can
run out of call stack.
The new library is in the \texttt{peg} vocabulary. I haven't yet implemented
the packrat side of things though so it is slow on large grammars and
inputs.
I've also done a proof of concept of something I've been meaning to do
for awhile. That is writing a parser for EBNF (or similar) that
produces Factor code in the form of parser combinators to implement
the described grammar. The code for this is in the \texttt{peg.ebnf}
vocabulary. It allows you to embed an EBNF-like language and have
Factor words generated for each rule:
\begin{alltt}
<EBNF
digit = '1' | '2' | '3' | '4' .
number = digit { digit } .
expr = number {('+' | '-' ) number } .
EBNF>
\end{alltt}
This example would create three Factor words. \texttt{digit}, \texttt{number} and
\texttt{expr}. These words return parsers that can be used as normal:
\begin{alltt}
"123" number parse
"1" digit parse
"1+243+342" expr parse
\end{alltt}
The EBNF accepted allows for choice, zero or more repetition, optional
(exactly 0 or 1), and grouping. The generated AST is pretty ugly so by
default it works best as a syntax checker. You can modify the
generated AST with action productions:
\begin{alltt}
<EBNF
digit = '1' | '2' | '3' | '4' => convert-to-digit .
number = digit { digit } => convert-to-number .
expr = number '+' number => convert-to-expr .
EBNF>
\end{alltt}
An action is a factor word after the \texttt{=>}. The word receives the AST
produced from the rule on the stack and it can replace that with a new
value that will be used in the AST. So \texttt{convert-to-expr} above might
produce a tuple holding the expression values (by default, a sequence
of terms in the rule are stored in a vector):
\begin{alltt}
TUPLE: ast-expr lhs operator rhs ;
C: <ast-expr> ast-expr
: convert-to-expr ( old -- new )
first3 <ast-expr> ;
\end{alltt}
The generated code is currently pretty ugly, mainly due to it being a
quick proof of concept. I'll try doing a few grammars and tidy it up,
changing the interface if needed, as I go along.
As an experiment I did a grammar for the PL/0 programming
language. It's in \texttt{peg.pl0}. The grammar from the wikipedia article
is:
\begin{alltt}
program = block "." .
block = [ "const" ident "=" number {"," ident "=" number} ";"]
[ "var" ident {"," ident} ";"]
{ "procedure" ident ";" block ";" } statement .
statement = [ ident ":=" expression | "call" ident |
"begin" statement {";" statement } "end" |
"if" condition "then" statement |
"while" condition "do" statement ].
condition = "odd" expression |
expression ("="|"#"|"<"|"<="|">"|">=") expression .
expression = [ "+"|"-"] term { ("+"|"-") term}.
term = factor {("*"|"/") factor}.
factor = ident | number | "(" expression ")".
\end{alltt}
The Factor grammar is very similar:
\begin{verbatim}
: ident ( -- parser )
CHAR: a CHAR: z range
CHAR: A CHAR: Z range 2array choice repeat1
[ >string ] action ;
: number ( -- parser )
CHAR: 0 CHAR: 9 range repeat1 [ string>number ] action ;
<EBNF
program = block '.' .
block = [ 'const' ident '=' number { ',' ident '=' number } ';' ]
[ 'var' ident { ',' ident } ';' ]
{ 'procedure' ident ';' [ block ';' ] } statement .
statement = [ ident ':=' expression | 'call' ident |
'begin' statement {';' statement } 'end' |
'if' condition 'then' statement |
'while' condition 'do' statement ] .
condition = 'odd' expression |
expression ('=' | '#' | '<=' | '<' | '>=' | '>') expression .
expression = ['+' | '-'] term {('+' | '-') term } .
term = factor {('*' | '/') factor } .
factor = ident | number | '(' expression ')'
EBNF>
\end{verbatim}
This grammar as defined works and can parse PL/0 programs. I'll extend
this as I improve the EBNF routines, adding actions, etc to generated
a decent AST.
\section{Parsing Arithmetic Expressions}
This is it without parenthesis handling. I'll leave it as an exercise
to add that. I included code to compile the ast into a factor
expression. Use like:
\begin{verbatim}
"2+3*4-6/2" expr parse parse-result-ast .
"2+3*4-6/2" expr parse parse-result-ast ast-compile dup .
"2+3*4-6/2" expr parse parse-result-ast ast-compile call
\end{verbatim}
The grammar is a direct translation of the one shown in the wikipedia
peg article:
\begin{verbatim}
http://en.wikipedia.org/wiki/Parsing_expression_grammar
\end{verbatim}
The only complication is transforming the result of using 'repeat0'
into a tree by doing a fold. This is standard practice with parsing
expression grammars that don't handle left recursion. When I add left
recursion this won't be needed and the grammar will be simplified.
See this article for details:
\begin{verbatim}
http://vpri.org/pdf/packrat_TR-2007-002.pdf
\end{verbatim}
\begin{verbatim}
USING: sequences kernel math words math.parser namespaces peg ;
IN: scratchpad
: seq* ( quot -- parser )
{ } make seq ; inline
: choice* ( quot -- parser )
{ } make choice ; inline
TUPLE: operator lhs op rhs ;
C: <operator> operator
: operator-fold ( lhs seq -- value )
#! Perform a fold of a lhs, followed by a sequence of pairs being
#! { operator rhs } in to a tree structure of the correct precedence.
swap [ first2 <operator> ] reduce ;
: digits ( -- parser )
CHAR: 0 CHAR: 9 range repeat1 [ string>number ] action ;
: plus ( -- parser )
"+" token ;
: minus ( -- parser )
"-" token ;
: multiply ( -- parser )
"*" token ;
: divide ( -- parser )
"/" token ;
DEFER: expr
: value ( -- parser )
[ digits , [ expr ] delay , ] choice* ;
: product ( -- parser )
[
value ,
[
[ multiply , divide , ] choice* ,
value ,
] seq* repeat0 ,
] seq* [ first2 operator-fold ] action ;
: sum ( -- parser )
[
product ,
[
[ plus , minus , ] choice* ,
product ,
] seq* repeat0 ,
] seq* [ first2 operator-fold ] action ;
: expr ( -- parser )
sum ;
GENERIC: (compile) ( ast -- )
M: number (compile) ( ast -- )
, ;
M: operator (compile) ( ast -- )
dup operator-lhs (compile)
dup operator-rhs (compile)
operator-op "math" lookup , ;
: ast-compile ( ast -- quot )
[ (compile) ] [ ] make ;
\end{verbatim}