Skip to content

Commit 8ea6a84

Browse files
authored
Merge pull request #191 from nikomatsakis/nikomatsakis-main
merge in parsing improvements
2 parents 57126c9 + 316adb5 commit 8ea6a84

File tree

18 files changed

+836
-148
lines changed

18 files changed

+836
-148
lines changed

book/src/formality_core/parse.md

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,18 @@ enum MyEnum {
1919
}
2020
```
2121

22-
### Ambiguity and greedy parsing
22+
### Succeeding, failing, and _almost_ succeeding
23+
24+
When you attempt to parse something, you'll get back a `Result`: either the parse succeeded (`Ok`), or it didn't (`Err`). But we actually distinguish three outcomes:
25+
26+
- Success: we parsed a value successfully. We generally implement a **greedy** parse, which means we will attempt to consume as many things we can. As a simple example, imagine you are parsing a list of numbers. If the input is `"1, 2, 3"`, we could choose to parse just `[1, 2]` (or indeed just `[1]`), but we will instead parse the full list.
27+
- For you parsing nerds, this is analogous to the commonly used rule to prefer shifts over reduces in LR parsers.
28+
- Failure: we tried to parse the value, but it clearly did not correspond to the thing we are looking for. This usually means that the first token was not a valid first token. This will give a not-very-helpful error message like "expected an `Expr`" (assuming we are parsing an `Expr`).
29+
- _Almost_ succeeded: this is a special case of failure where we got part-way through parsing, consuming some tokens, but then encountered an error. So for example if we had an input like `"1 / / 3"`, we might give an error like "expected an `Expr`, found `/`". Exactly how many tokens we have to consume before we consider something to have 'almost' succeeded depends on the thing we are parsing (see the discussion on _commit points_ below).
30+
31+
Both failure and 'almost' succeeding correspond to a return value of `Err`. The difference is in the errors contained in the result. If there is a single error and it occurs at the start of the input (possibly after skipping whitespace), that is considered **failure**. Otherwise the parse "almost" succeeded. The distinction between failure and "almost" succeeding helps us to give better error messages, but it is also important for "optional" parsing or when parsing repeated items.
32+
33+
### Resolving ambiguity, greedy parsing
2334

2435
When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the **longest match**. However, we don't just consider the number of characters, we look for a **reduction prefix**:
2536

@@ -53,11 +64,10 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
5364
- Most things are _terminals_ or _tokens_: this means they just match themselves:
5465
- For example, the `*` in `#[grammar($v0 * $v1)]` is a terminal, and it means to parse a `*` from the input.
5566
- Delimeters are accepted but must be matched, e.g., `( /* tokens */ )` or `[ /* tokens */ ]`.
56-
- Things beginning with `$` are _nonterminals_ -- they parse the contents of a field. The grammar for a field is generally determined from its type.
67+
- The `$` character is used to introduce special matches. Generally these are _nonterminals_, which means they parse the contents of a field, where the grammar for a field is determined by its type.
5768
- If fields have names, then `$field` should name the field.
5869
- For position fields (e.g., the T and U in `Mul(Expr, Expr)`), use `$v0`, `$v1`, etc.
59-
- Exception: `$$` is treated as the terminal `'$'`.
60-
- Nonterminals have various modes:
70+
- Valid uses of `$` are as follows:
6171
- `$field` -- just parse the field's type
6272
- `$*field` -- the field must be a collection of `T` (e.g., `Vec<T>`, `Set<T>`) -- parse any number of `T` instances. Something like `[ $*field ]` would parse `[f1 f2 f3]`, assuming `f1`, `f2`, and `f3` are valid values for `field`.
6373
- `$,field` -- similar to the above, but uses a comma separated list (with optional trailing comma). So `[ $,field ]` will parse something like `[f1, f2, f3]`.
@@ -71,10 +81,50 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
7181
- `${field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`
7282
- `${?field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`, but accept empty string as empty vector
7383
- `$:guard <nonterminal>` -- parses `<nonterminal>` but only if the keyword `guard` is present. For example, `$:where $,where_clauses` would parse `where WhereClause1, WhereClause2, WhereClause3` but would also accept nothing (in which case, you would get an empty vector).
84+
- `$!` -- marks a commit point, see the section on greediness below
85+
- `$$` -- parse the terminal `$`
86+
87+
### Commit points and greedy parsing
88+
89+
When you parse an optional (e.g., `$?field`) or repeated (e.g., `$*field`) nonterminal, it raises an interesting question. We will attempt to parse the given field, but how do we treat an error? It could mean that the field is not present, but it also could mean a syntax error on the part of the user. To resolve this, we make use of the distinction between failure and _almost_ succeeding that we introduced earlier:
90+
91+
- If parsing `field` outright **fails**, that means that the field was not present, and so the parse can continue with the field having its `Default::default()` value.
92+
- If parsing `field` **almost succeeds**, then we assume it was present, but there is a syntax error, and so parsing fails.
93+
94+
The default rule is that parsing "almost" succeeds if it consumes at least one token. So e.g. if you had...
95+
96+
```rust
97+
#[term]
98+
enum Projection {
99+
#[grammar(. $v0)]
100+
Field(Id),
101+
}
102+
```
74103

75-
### Greediness
104+
...and you tried to parse `".#"`, that would "almost" succeed, because it would consume the `.` but then fail to find an identifier.
105+
106+
Sometimes this rule is not quite right. For example, maybe the `Projection` type is embedded in another type like
107+
108+
```rust
109+
#[term($*projections . #)]
110+
struct ProjectionsThenHash {
111+
projections: Vec<Projection>,
112+
}
113+
```
114+
115+
For `ProjectionsThenHash`, we would consider `".#"` to be a valid parse -- it starts out with no projections and then parses `.#`. But if you try this, you will get an error because the `.#` is considered to be an "almost success" of a projection.
116+
117+
You can control this by indicating a "commit point" with `$!`. If `$!` is present, the parse is failure unless the commit point has been reached. For our grammar above, modifying `Projection` to have a commit point _after_ the identifier will let `ProjectionsThenHash` parse as expected:
118+
119+
```rust
120+
#[term]
121+
enum Projection {
122+
#[grammar(. $v0 $!)]
123+
Field(Id),
124+
}
125+
```
76126

77-
Parsing is generally greedy. So `$*x` and `$,x`, for example, consume as many entries as they can. Typically this works best if `x` begins with some symbol that indicates whether it is present.
127+
See the `parser_torture_tests::commit_points` code for an example of this in action.
78128

79129
### Default grammar
80130

crates/formality-core/src/cast.rs

Lines changed: 19 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -246,28 +246,27 @@ where
246246
}
247247
}
248248

249-
impl<A, B, A1, B1> UpcastFrom<(A1, B1)> for (A, B)
250-
where
251-
A1: Upcast<A>,
252-
B1: Upcast<B>,
253-
{
254-
fn upcast_from(term: (A1, B1)) -> Self {
255-
let (a1, b1) = term;
256-
(a1.upcast(), b1.upcast())
257-
}
249+
macro_rules! tuple_impl {
250+
($($from:ident),*; $($to:ident),*) => {
251+
impl<$($from,)* $($to,)*> UpcastFrom<($($from,)*)> for ($($to,)*)
252+
where
253+
$($from: Upcast<$to>,)*
254+
{
255+
#[allow(non_snake_case)]
256+
fn upcast_from(term: ($($from,)*)) -> Self {
257+
let ($($from,)*) = term;
258+
($($from.upcast(),)*)
259+
}
260+
}
261+
};
258262
}
259263

260-
impl<A, B, C, A1, B1, C1> UpcastFrom<(A1, B1, C1)> for (A, B, C)
261-
where
262-
A1: Upcast<A>,
263-
B1: Upcast<B>,
264-
C1: Upcast<C>,
265-
{
266-
fn upcast_from(term: (A1, B1, C1)) -> Self {
267-
let (a1, b1, c1) = term;
268-
(a1.upcast(), b1.upcast(), c1.upcast())
269-
}
270-
}
264+
tuple_impl!(A1, B1; A, B);
265+
tuple_impl!(A1, B1, C1; A, B, C);
266+
tuple_impl!(A1, B1, C1, D1; A, B, C, D);
267+
tuple_impl!(A1, B1, C1, D1, E1; A, B, C, D, E);
268+
tuple_impl!(A1, B1, C1, D1, E1, F1; A, B, C, D, E, F);
269+
tuple_impl!(A1, B1, C1, D1, E1, F1, G1; A, B, C, D, E, F, G);
271270

272271
#[macro_export]
273272
macro_rules! cast_impl {

crates/formality-core/src/collections.rs

Lines changed: 64 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,10 @@
33
44
use std::collections::{BTreeMap, BTreeSet};
55

6-
use crate::cast::{DowncastTo, Upcast, UpcastFrom, Upcasted};
6+
use crate::{
7+
cast::{DowncastTo, Upcast, UpcastFrom, Upcasted},
8+
Downcast,
9+
};
710

811
pub type Map<K, V> = BTreeMap<K, V>;
912
pub type Set<E> = BTreeSet<E>;
@@ -136,6 +139,31 @@ impl<T> DowncastTo<()> for Set<T> {
136139
}
137140
}
138141

142+
impl<A, T> DowncastTo<(A,)> for Set<T>
143+
where
144+
T: DowncastTo<A> + Ord,
145+
{
146+
fn downcast_to(&self) -> Option<(A,)> {
147+
if self.len() == 1 {
148+
let a: A = self.first().unwrap().downcast()?;
149+
Some((a,))
150+
} else {
151+
None
152+
}
153+
}
154+
}
155+
156+
impl<A, T> UpcastFrom<(A,)> for Set<T>
157+
where
158+
A: Clone + Upcast<T>,
159+
T: Ord + Clone,
160+
{
161+
fn upcast_from(term: (A,)) -> Self {
162+
let (a,) = term;
163+
set![a.upcast()]
164+
}
165+
}
166+
139167
impl<T> DowncastTo<()> for Vec<T> {
140168
fn downcast_to(&self) -> Option<()> {
141169
if self.is_empty() {
@@ -146,6 +174,31 @@ impl<T> DowncastTo<()> for Vec<T> {
146174
}
147175
}
148176

177+
impl<A, T> DowncastTo<(A,)> for Vec<T>
178+
where
179+
T: DowncastTo<A>,
180+
{
181+
fn downcast_to(&self) -> Option<(A,)> {
182+
if self.len() == 1 {
183+
let a: A = self.first().unwrap().downcast()?;
184+
Some((a,))
185+
} else {
186+
None
187+
}
188+
}
189+
}
190+
191+
impl<A, T> UpcastFrom<(A,)> for Vec<T>
192+
where
193+
A: Clone + Upcast<T>,
194+
T: Clone,
195+
{
196+
fn upcast_from(term: (A,)) -> Self {
197+
let (a,) = term;
198+
vec![a.upcast()]
199+
}
200+
}
201+
149202
macro_rules! tuple_upcast {
150203
($coll:ident : $($name:ident),*) => {
151204
/// Upcast from a tuple of iterable things into a collection.
@@ -187,13 +240,15 @@ tuple_upcast!(Vec: A, B, C, D);
187240
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
188241
pub struct Cons<T, C>(pub T, pub C);
189242

190-
impl<T, U> UpcastFrom<Cons<U, Set<T>>> for Set<T>
243+
impl<T, U, V> UpcastFrom<Cons<U, V>> for Set<T>
191244
where
192245
T: Ord + Clone,
193246
U: Upcast<T>,
247+
V: Upcast<Set<T>>,
194248
{
195-
fn upcast_from(term: Cons<U, Set<T>>) -> Self {
196-
let Cons(elem, mut set) = term;
249+
fn upcast_from(term: Cons<U, V>) -> Self {
250+
let Cons(elem, set) = term;
251+
let mut set: Set<T> = set.upcast();
197252
set.insert(elem.upcast());
198253
set
199254
}
@@ -214,13 +269,14 @@ where
214269
}
215270
}
216271

217-
impl<T, U> UpcastFrom<Cons<U, Vec<T>>> for Vec<T>
272+
impl<T, U, V> UpcastFrom<Cons<U, V>> for Vec<T>
218273
where
219-
T: Ord + Clone,
220274
U: Upcast<T>,
275+
V: Upcast<Vec<T>>,
221276
{
222-
fn upcast_from(term: Cons<U, Vec<T>>) -> Self {
223-
let Cons(elem, mut vec) = term;
277+
fn upcast_from(term: Cons<U, V>) -> Self {
278+
let Cons(elem, vec) = term;
279+
let mut vec: Vec<T> = vec.upcast();
224280
vec.insert(0, elem.upcast());
225281
vec
226282
}

crates/formality-core/src/judgment.rs

Lines changed: 70 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
use std::cell::RefCell;
22

3-
use crate::{fixed_point::FixedPointStack, Set};
3+
use crate::{fixed_point::FixedPointStack, Fallible, Set};
4+
5+
mod assertion;
6+
pub use assertion::JudgmentAssertion;
47

58
mod proven_set;
69
pub use proven_set::{FailedJudgment, FailedRule, ProvenSet, RuleFailureCause, TryIntoIter};
710

11+
mod test_fallible;
812
mod test_filtered;
913
mod test_reachable;
1014

@@ -76,8 +80,18 @@ macro_rules! judgment_fn {
7680
$(let $input_name: $input_ty = $crate::Upcast::upcast($input_name);)*
7781

7882
$(
79-
// Assertions are preconditions
80-
assert!($assert_expr);
83+
// Assertions are preconditions.
84+
//
85+
// NB: we can't use `$crate` in this expression because of the `respan!` call,
86+
// which messes up `$crate` resolution. But we need the respan call for track_caller to properly
87+
// assign the span of the panic to the assertion expression and not the invocation of the judgment_fn
88+
// macro. Annoying! But our proc macros already reference `formality_core` so that seems ok.
89+
$crate::respan!(
90+
$assert_expr
91+
(
92+
formality_core::judgment::JudgmentAssertion::assert($assert_expr, stringify!($assert_expr));
93+
)
94+
);
8195
)*
8296

8397
$(
@@ -290,14 +304,21 @@ macro_rules! push_rules {
290304
// output of this rule, once all the conditions are evaluated.
291305

292306
(@body $args:tt; $inputs:tt; $step_index:expr; (if let $p:pat = $e:expr) $($m:tt)*) => {
293-
let value = &$e;
294-
if let $p = Clone::clone(value) {
295-
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
296-
} else {
297-
$crate::push_rules!(@record_failure $inputs; $step_index, $e; $crate::judgment::RuleFailureCause::IfLetDidNotMatch {
298-
pattern: stringify!($p).to_string(),
299-
value: format!("{:?}", value),
300-
});
307+
match $crate::judgment::try_catch(|| Ok($e)) {
308+
Ok(value) => {
309+
if let $p = Clone::clone(&value) {
310+
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
311+
} else {
312+
$crate::push_rules!(@record_failure $inputs; $step_index, $e; $crate::judgment::RuleFailureCause::IfLetDidNotMatch {
313+
pattern: stringify!($p).to_string(),
314+
value: format!("{:?}", value),
315+
});
316+
}
317+
}
318+
319+
Err(e) => {
320+
$crate::push_rules!(@record_failure $inputs; $step_index, $e; e);
321+
}
301322
}
302323
};
303324

@@ -402,10 +423,31 @@ macro_rules! push_rules {
402423
}
403424
};
404425

426+
(@body $args:tt; $inputs:tt; $step_index:expr; (let $p:ident /*[1]*/: $t:ty = $i:expr) $($m:tt)*) => {
427+
// [1] I'd prefer to have `$p:pat` but the follow-set rules don't allow for it.
428+
// That's dumb.
429+
match $crate::judgment::try_catch::<$t>(|| Ok($i)) {
430+
Ok(p) => {
431+
let $p = p;
432+
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
433+
}
434+
435+
Err(e) => {
436+
$crate::push_rules!(@record_failure $inputs; $step_index, $i; e);
437+
}
438+
}
439+
};
440+
405441
(@body $args:tt; $inputs:tt; $step_index:expr; (let $p:pat = $i:expr) $($m:tt)*) => {
406-
{
407-
let $p = $i;
408-
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
442+
match $crate::judgment::try_catch(|| Ok($i)) {
443+
Ok(p) => {
444+
let $p = p; // this enforces that `$p` is infalliblr
445+
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
446+
}
447+
448+
Err(e) => {
449+
$crate::push_rules!(@record_failure $inputs; $step_index, $i; e);
450+
}
409451
}
410452
};
411453

@@ -455,3 +497,17 @@ macro_rules! push_rules {
455497
}
456498
}
457499
}
500+
501+
/// Helper function that just calls `f` and returns the value.
502+
/// Used for implementing `judgement_fn` macro to allow expressions to include `?`.
503+
pub fn try_catch<R>(f: impl FnOnce() -> Fallible<R>) -> Result<R, RuleFailureCause> {
504+
match f() {
505+
Ok(v) => Ok(v),
506+
507+
// Kind of dumb that `Inapplicable` only includes a `String` and not an `anyhow::Error`
508+
// but it's super annoying to package one of those up in the way we want.
509+
Err(e) => Err(RuleFailureCause::Inapplicable {
510+
reason: e.to_string(),
511+
}),
512+
}
513+
}
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
use std::fmt::Debug;
2+
3+
/// Helper trait for assertions in judgments.
4+
/// For each `assert(x)`, we will invoke `JudgmentAssertion::assert`.
5+
/// This allows us to support both booleans and results.
6+
pub trait JudgmentAssertion {
7+
fn assert(self, expr: &str);
8+
}
9+
10+
impl<E: Debug> JudgmentAssertion for Result<(), E> {
11+
#[track_caller]
12+
fn assert(self, expr: &str) {
13+
match self {
14+
Ok(()) => (),
15+
Err(e) => panic!("judgment assertion failed: `{expr}` got {e:?}"),
16+
}
17+
}
18+
}
19+
20+
impl JudgmentAssertion for bool {
21+
#[track_caller]
22+
fn assert(self, expr: &str) {
23+
assert!(self, "judgment assertion failed: `{expr}` is false");
24+
}
25+
}

0 commit comments

Comments
 (0)