You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/src/formality_core/parse.md
+56-6Lines changed: 56 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,18 @@ enum MyEnum {
19
19
}
20
20
```
21
21
22
-
### Ambiguity and greedy parsing
22
+
### Succeeding, failing, and _almost_ succeeding
23
+
24
+
When you attempt to parse something, you'll get back a `Result`: either the parse succeeded (`Ok`), or it didn't (`Err`). But we actually distinguish three outcomes:
25
+
26
+
- Success: we parsed a value successfully. We generally implement a **greedy** parse, which means we will attempt to consume as many things we can. As a simple example, imagine you are parsing a list of numbers. If the input is `"1, 2, 3"`, we could choose to parse just `[1, 2]` (or indeed just `[1]`), but we will instead parse the full list.
27
+
- For you parsing nerds, this is analogous to the commonly used rule to prefer shifts over reduces in LR parsers.
28
+
- Failure: we tried to parse the value, but it clearly did not correspond to the thing we are looking for. This usually means that the first token was not a valid first token. This will give a not-very-helpful error message like "expected an `Expr`" (assuming we are parsing an `Expr`).
29
+
-_Almost_ succeeded: this is a special case of failure where we got part-way through parsing, consuming some tokens, but then encountered an error. So for example if we had an input like `"1 / / 3"`, we might give an error like "expected an `Expr`, found `/`". Exactly how many tokens we have to consume before we consider something to have 'almost' succeeded depends on the thing we are parsing (see the discussion on _commit points_ below).
30
+
31
+
Both failure and 'almost' succeeding correspond to a return value of `Err`. The difference is in the errors contained in the result. If there is a single error and it occurs at the start of the input (possibly after skipping whitespace), that is considered **failure**. Otherwise the parse "almost" succeeded. The distinction between failure and "almost" succeeding helps us to give better error messages, but it is also important for "optional" parsing or when parsing repeated items.
32
+
33
+
### Resolving ambiguity, greedy parsing
23
34
24
35
When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the **longest match**. However, we don't just consider the number of characters, we look for a **reduction prefix**:
25
36
@@ -53,11 +64,10 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
53
64
- Most things are _terminals_ or _tokens_: this means they just match themselves:
54
65
- For example, the `*` in `#[grammar($v0 * $v1)]` is a terminal, and it means to parse a `*` from the input.
55
66
- Delimeters are accepted but must be matched, e.g., `( /* tokens */ )` or `[ /* tokens */ ]`.
56
-
-Things beginning with `$` are _nonterminals_ -- they parse the contents of a field. The grammar for a field is generally determined from its type.
67
+
-The `$`character is used to introduce special matches. Generally these are _nonterminals_, which means they parse the contents of a field, where the grammar for a field is determined by its type.
57
68
- If fields have names, then `$field` should name the field.
58
69
- For position fields (e.g., the T and U in `Mul(Expr, Expr)`), use `$v0`, `$v1`, etc.
59
-
- Exception: `$$` is treated as the terminal `'$'`.
60
-
- Nonterminals have various modes:
70
+
- Valid uses of `$` are as follows:
61
71
-`$field` -- just parse the field's type
62
72
-`$*field` -- the field must be a collection of `T` (e.g., `Vec<T>`, `Set<T>`) -- parse any number of `T` instances. Something like `[ $*field ]` would parse `[f1 f2 f3]`, assuming `f1`, `f2`, and `f3` are valid values for `field`.
63
73
-`$,field` -- similar to the above, but uses a comma separated list (with optional trailing comma). So `[ $,field ]` will parse something like `[f1, f2, f3]`.
@@ -71,10 +81,50 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
71
81
-`${field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`
72
82
-`${?field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`, but accept empty string as empty vector
73
83
-`$:guard <nonterminal>` -- parses `<nonterminal>` but only if the keyword `guard` is present. For example, `$:where $,where_clauses` would parse `where WhereClause1, WhereClause2, WhereClause3` but would also accept nothing (in which case, you would get an empty vector).
84
+
-`$!` -- marks a commit point, see the section on greediness below
85
+
-`$$` -- parse the terminal `$`
86
+
87
+
### Commit points and greedy parsing
88
+
89
+
When you parse an optional (e.g., `$?field`) or repeated (e.g., `$*field`) nonterminal, it raises an interesting question. We will attempt to parse the given field, but how do we treat an error? It could mean that the field is not present, but it also could mean a syntax error on the part of the user. To resolve this, we make use of the distinction between failure and _almost_ succeeding that we introduced earlier:
90
+
91
+
- If parsing `field` outright **fails**, that means that the field was not present, and so the parse can continue with the field having its `Default::default()` value.
92
+
- If parsing `field`**almost succeeds**, then we assume it was present, but there is a syntax error, and so parsing fails.
93
+
94
+
The default rule is that parsing "almost" succeeds if it consumes at least one token. So e.g. if you had...
95
+
96
+
```rust
97
+
#[term]
98
+
enumProjection {
99
+
#[grammar(.$v0)]
100
+
Field(Id),
101
+
}
102
+
```
74
103
75
-
### Greediness
104
+
...and you tried to parse `".#"`, that would "almost" succeed, because it would consume the `.` but then fail to find an identifier.
105
+
106
+
Sometimes this rule is not quite right. For example, maybe the `Projection` type is embedded in another type like
107
+
108
+
```rust
109
+
#[term($*projections . #)]
110
+
structProjectionsThenHash {
111
+
projections:Vec<Projection>,
112
+
}
113
+
```
114
+
115
+
For `ProjectionsThenHash`, we would consider `".#"` to be a valid parse -- it starts out with no projections and then parses `.#`. But if you try this, you will get an error because the `.#` is considered to be an "almost success" of a projection.
116
+
117
+
You can control this by indicating a "commit point" with `$!`. If `$!` is present, the parse is failure unless the commit point has been reached. For our grammar above, modifying `Projection` to have a commit point _after_ the identifier will let `ProjectionsThenHash` parse as expected:
118
+
119
+
```rust
120
+
#[term]
121
+
enumProjection {
122
+
#[grammar(.$v0 $!)]
123
+
Field(Id),
124
+
}
125
+
```
76
126
77
-
Parsing is generally greedy. So `$*x` and `$,x`, for example, consume as many entries as they can. Typically this works best if `x` begins with some symbol that indicates whether it is present.
127
+
See the `parser_torture_tests::commit_points` code for an example of this in action.
0 commit comments