Skip to content

merge in parsing improvements #191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 24, 2025
Merged
62 changes: 56 additions & 6 deletions book/src/formality_core/parse.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,18 @@ enum MyEnum {
}
```

### Ambiguity and greedy parsing
### Succeeding, failing, and _almost_ succeeding

When you attempt to parse something, you'll get back a `Result`: either the parse succeeded (`Ok`), or it didn't (`Err`). But we actually distinguish three outcomes:

- Success: we parsed a value successfully. We generally implement a **greedy** parse, which means we will attempt to consume as many things we can. As a simple example, imagine you are parsing a list of numbers. If the input is `"1, 2, 3"`, we could choose to parse just `[1, 2]` (or indeed just `[1]`), but we will instead parse the full list.
- For you parsing nerds, this is analogous to the commonly used rule to prefer shifts over reduces in LR parsers.
- Failure: we tried to parse the value, but it clearly did not correspond to the thing we are looking for. This usually means that the first token was not a valid first token. This will give a not-very-helpful error message like "expected an `Expr`" (assuming we are parsing an `Expr`).
- _Almost_ succeeded: this is a special case of failure where we got part-way through parsing, consuming some tokens, but then encountered an error. So for example if we had an input like `"1 / / 3"`, we might give an error like "expected an `Expr`, found `/`". Exactly how many tokens we have to consume before we consider something to have 'almost' succeeded depends on the thing we are parsing (see the discussion on _commit points_ below).

Both failure and 'almost' succeeding correspond to a return value of `Err`. The difference is in the errors contained in the result. If there is a single error and it occurs at the start of the input (possibly after skipping whitespace), that is considered **failure**. Otherwise the parse "almost" succeeded. The distinction between failure and "almost" succeeding helps us to give better error messages, but it is also important for "optional" parsing or when parsing repeated items.

### Resolving ambiguity, greedy parsing

When parsing an enum there will be multiple possibilities. We will attempt to parse them all. If more than one succeeds, the parser will attempt to resolve the ambiguity by looking for the **longest match**. However, we don't just consider the number of characters, we look for a **reduction prefix**:

Expand Down Expand Up @@ -53,11 +64,10 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
- Most things are _terminals_ or _tokens_: this means they just match themselves:
- For example, the `*` in `#[grammar($v0 * $v1)]` is a terminal, and it means to parse a `*` from the input.
- Delimeters are accepted but must be matched, e.g., `( /* tokens */ )` or `[ /* tokens */ ]`.
- Things beginning with `$` are _nonterminals_ -- they parse the contents of a field. The grammar for a field is generally determined from its type.
- The `$` character is used to introduce special matches. Generally these are _nonterminals_, which means they parse the contents of a field, where the grammar for a field is determined by its type.
- If fields have names, then `$field` should name the field.
- For position fields (e.g., the T and U in `Mul(Expr, Expr)`), use `$v0`, `$v1`, etc.
- Exception: `$$` is treated as the terminal `'$'`.
- Nonterminals have various modes:
- Valid uses of `$` are as follows:
- `$field` -- just parse the field's type
- `$*field` -- the field must be a collection of `T` (e.g., `Vec<T>`, `Set<T>`) -- parse any number of `T` instances. Something like `[ $*field ]` would parse `[f1 f2 f3]`, assuming `f1`, `f2`, and `f3` are valid values for `field`.
- `$,field` -- similar to the above, but uses a comma separated list (with optional trailing comma). So `[ $,field ]` will parse something like `[f1, f2, f3]`.
Expand All @@ -71,10 +81,50 @@ A grammar consists of a series of _symbols_. Each symbol matches some text in th
- `${field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`
- `${?field}` -- parse `{E1, E2, E3}`, where `field` is a collection of `E`, but accept empty string as empty vector
- `$:guard <nonterminal>` -- parses `<nonterminal>` but only if the keyword `guard` is present. For example, `$:where $,where_clauses` would parse `where WhereClause1, WhereClause2, WhereClause3` but would also accept nothing (in which case, you would get an empty vector).
- `$!` -- marks a commit point, see the section on greediness below
- `$$` -- parse the terminal `$`

### Commit points and greedy parsing

When you parse an optional (e.g., `$?field`) or repeated (e.g., `$*field`) nonterminal, it raises an interesting question. We will attempt to parse the given field, but how do we treat an error? It could mean that the field is not present, but it also could mean a syntax error on the part of the user. To resolve this, we make use of the distinction between failure and _almost_ succeeding that we introduced earlier:

- If parsing `field` outright **fails**, that means that the field was not present, and so the parse can continue with the field having its `Default::default()` value.
- If parsing `field` **almost succeeds**, then we assume it was present, but there is a syntax error, and so parsing fails.

The default rule is that parsing "almost" succeeds if it consumes at least one token. So e.g. if you had...

```rust
#[term]
enum Projection {
#[grammar(. $v0)]
Field(Id),
}
```

### Greediness
...and you tried to parse `".#"`, that would "almost" succeed, because it would consume the `.` but then fail to find an identifier.

Sometimes this rule is not quite right. For example, maybe the `Projection` type is embedded in another type like

```rust
#[term($*projections . #)]
struct ProjectionsThenHash {
projections: Vec<Projection>,
}
```

For `ProjectionsThenHash`, we would consider `".#"` to be a valid parse -- it starts out with no projections and then parses `.#`. But if you try this, you will get an error because the `.#` is considered to be an "almost success" of a projection.

You can control this by indicating a "commit point" with `$!`. If `$!` is present, the parse is failure unless the commit point has been reached. For our grammar above, modifying `Projection` to have a commit point _after_ the identifier will let `ProjectionsThenHash` parse as expected:

```rust
#[term]
enum Projection {
#[grammar(. $v0 $!)]
Field(Id),
}
```

Parsing is generally greedy. So `$*x` and `$,x`, for example, consume as many entries as they can. Typically this works best if `x` begins with some symbol that indicates whether it is present.
See the `parser_torture_tests::commit_points` code for an example of this in action.

### Default grammar

Expand Down
39 changes: 19 additions & 20 deletions crates/formality-core/src/cast.rs
Original file line number Diff line number Diff line change
Expand Up @@ -246,28 +246,27 @@ where
}
}

impl<A, B, A1, B1> UpcastFrom<(A1, B1)> for (A, B)
where
A1: Upcast<A>,
B1: Upcast<B>,
{
fn upcast_from(term: (A1, B1)) -> Self {
let (a1, b1) = term;
(a1.upcast(), b1.upcast())
}
macro_rules! tuple_impl {
($($from:ident),*; $($to:ident),*) => {
impl<$($from,)* $($to,)*> UpcastFrom<($($from,)*)> for ($($to,)*)
where
$($from: Upcast<$to>,)*
{
#[allow(non_snake_case)]
fn upcast_from(term: ($($from,)*)) -> Self {
let ($($from,)*) = term;
($($from.upcast(),)*)
}
}
};
}

impl<A, B, C, A1, B1, C1> UpcastFrom<(A1, B1, C1)> for (A, B, C)
where
A1: Upcast<A>,
B1: Upcast<B>,
C1: Upcast<C>,
{
fn upcast_from(term: (A1, B1, C1)) -> Self {
let (a1, b1, c1) = term;
(a1.upcast(), b1.upcast(), c1.upcast())
}
}
tuple_impl!(A1, B1; A, B);
tuple_impl!(A1, B1, C1; A, B, C);
tuple_impl!(A1, B1, C1, D1; A, B, C, D);
tuple_impl!(A1, B1, C1, D1, E1; A, B, C, D, E);
tuple_impl!(A1, B1, C1, D1, E1, F1; A, B, C, D, E, F);
tuple_impl!(A1, B1, C1, D1, E1, F1, G1; A, B, C, D, E, F, G);

#[macro_export]
macro_rules! cast_impl {
Expand Down
72 changes: 64 additions & 8 deletions crates/formality-core/src/collections.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@

use std::collections::{BTreeMap, BTreeSet};

use crate::cast::{DowncastTo, Upcast, UpcastFrom, Upcasted};
use crate::{
cast::{DowncastTo, Upcast, UpcastFrom, Upcasted},
Downcast,
};

pub type Map<K, V> = BTreeMap<K, V>;
pub type Set<E> = BTreeSet<E>;
Expand Down Expand Up @@ -136,6 +139,31 @@ impl<T> DowncastTo<()> for Set<T> {
}
}

impl<A, T> DowncastTo<(A,)> for Set<T>
where
T: DowncastTo<A> + Ord,
{
fn downcast_to(&self) -> Option<(A,)> {
if self.len() == 1 {
let a: A = self.first().unwrap().downcast()?;
Some((a,))
} else {
None
}
}
}

impl<A, T> UpcastFrom<(A,)> for Set<T>
where
A: Clone + Upcast<T>,
T: Ord + Clone,
{
fn upcast_from(term: (A,)) -> Self {
let (a,) = term;
set![a.upcast()]
}
}

impl<T> DowncastTo<()> for Vec<T> {
fn downcast_to(&self) -> Option<()> {
if self.is_empty() {
Expand All @@ -146,6 +174,31 @@ impl<T> DowncastTo<()> for Vec<T> {
}
}

impl<A, T> DowncastTo<(A,)> for Vec<T>
where
T: DowncastTo<A>,
{
fn downcast_to(&self) -> Option<(A,)> {
if self.len() == 1 {
let a: A = self.first().unwrap().downcast()?;
Some((a,))
} else {
None
}
}
}

impl<A, T> UpcastFrom<(A,)> for Vec<T>
where
A: Clone + Upcast<T>,
T: Clone,
{
fn upcast_from(term: (A,)) -> Self {
let (a,) = term;
vec![a.upcast()]
}
}

macro_rules! tuple_upcast {
($coll:ident : $($name:ident),*) => {
/// Upcast from a tuple of iterable things into a collection.
Expand Down Expand Up @@ -187,13 +240,15 @@ tuple_upcast!(Vec: A, B, C, D);
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Cons<T, C>(pub T, pub C);

impl<T, U> UpcastFrom<Cons<U, Set<T>>> for Set<T>
impl<T, U, V> UpcastFrom<Cons<U, V>> for Set<T>
where
T: Ord + Clone,
U: Upcast<T>,
V: Upcast<Set<T>>,
{
fn upcast_from(term: Cons<U, Set<T>>) -> Self {
let Cons(elem, mut set) = term;
fn upcast_from(term: Cons<U, V>) -> Self {
let Cons(elem, set) = term;
let mut set: Set<T> = set.upcast();
set.insert(elem.upcast());
set
}
Expand All @@ -214,13 +269,14 @@ where
}
}

impl<T, U> UpcastFrom<Cons<U, Vec<T>>> for Vec<T>
impl<T, U, V> UpcastFrom<Cons<U, V>> for Vec<T>
where
T: Ord + Clone,
U: Upcast<T>,
V: Upcast<Vec<T>>,
{
fn upcast_from(term: Cons<U, Vec<T>>) -> Self {
let Cons(elem, mut vec) = term;
fn upcast_from(term: Cons<U, V>) -> Self {
let Cons(elem, vec) = term;
let mut vec: Vec<T> = vec.upcast();
vec.insert(0, elem.upcast());
vec
}
Expand Down
84 changes: 70 additions & 14 deletions crates/formality-core/src/judgment.rs
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
use std::cell::RefCell;

use crate::{fixed_point::FixedPointStack, Set};
use crate::{fixed_point::FixedPointStack, Fallible, Set};

mod assertion;
pub use assertion::JudgmentAssertion;

mod proven_set;
pub use proven_set::{FailedJudgment, FailedRule, ProvenSet, RuleFailureCause, TryIntoIter};

mod test_fallible;
mod test_filtered;
mod test_reachable;

Expand Down Expand Up @@ -76,8 +80,18 @@ macro_rules! judgment_fn {
$(let $input_name: $input_ty = $crate::Upcast::upcast($input_name);)*

$(
// Assertions are preconditions
assert!($assert_expr);
// Assertions are preconditions.
//
// NB: we can't use `$crate` in this expression because of the `respan!` call,
// which messes up `$crate` resolution. But we need the respan call for track_caller to properly
// assign the span of the panic to the assertion expression and not the invocation of the judgment_fn
// macro. Annoying! But our proc macros already reference `formality_core` so that seems ok.
$crate::respan!(
$assert_expr
(
formality_core::judgment::JudgmentAssertion::assert($assert_expr, stringify!($assert_expr));
)
);
)*

$(
Expand Down Expand Up @@ -290,14 +304,21 @@ macro_rules! push_rules {
// output of this rule, once all the conditions are evaluated.

(@body $args:tt; $inputs:tt; $step_index:expr; (if let $p:pat = $e:expr) $($m:tt)*) => {
let value = &$e;
if let $p = Clone::clone(value) {
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
} else {
$crate::push_rules!(@record_failure $inputs; $step_index, $e; $crate::judgment::RuleFailureCause::IfLetDidNotMatch {
pattern: stringify!($p).to_string(),
value: format!("{:?}", value),
});
match $crate::judgment::try_catch(|| Ok($e)) {
Ok(value) => {
if let $p = Clone::clone(&value) {
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
} else {
$crate::push_rules!(@record_failure $inputs; $step_index, $e; $crate::judgment::RuleFailureCause::IfLetDidNotMatch {
pattern: stringify!($p).to_string(),
value: format!("{:?}", value),
});
}
}

Err(e) => {
$crate::push_rules!(@record_failure $inputs; $step_index, $e; e);
}
}
};

Expand Down Expand Up @@ -402,10 +423,31 @@ macro_rules! push_rules {
}
};

(@body $args:tt; $inputs:tt; $step_index:expr; (let $p:ident /*[1]*/: $t:ty = $i:expr) $($m:tt)*) => {
// [1] I'd prefer to have `$p:pat` but the follow-set rules don't allow for it.
// That's dumb.
match $crate::judgment::try_catch::<$t>(|| Ok($i)) {
Ok(p) => {
let $p = p;
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
}

Err(e) => {
$crate::push_rules!(@record_failure $inputs; $step_index, $i; e);
}
}
};

(@body $args:tt; $inputs:tt; $step_index:expr; (let $p:pat = $i:expr) $($m:tt)*) => {
{
let $p = $i;
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
match $crate::judgment::try_catch(|| Ok($i)) {
Ok(p) => {
let $p = p; // this enforces that `$p` is infalliblr
$crate::push_rules!(@body $args; $inputs; $step_index + 1; $($m)*);
}

Err(e) => {
$crate::push_rules!(@record_failure $inputs; $step_index, $i; e);
}
}
};

Expand Down Expand Up @@ -455,3 +497,17 @@ macro_rules! push_rules {
}
}
}

/// Helper function that just calls `f` and returns the value.
/// Used for implementing `judgement_fn` macro to allow expressions to include `?`.
pub fn try_catch<R>(f: impl FnOnce() -> Fallible<R>) -> Result<R, RuleFailureCause> {
match f() {
Ok(v) => Ok(v),

// Kind of dumb that `Inapplicable` only includes a `String` and not an `anyhow::Error`
// but it's super annoying to package one of those up in the way we want.
Err(e) => Err(RuleFailureCause::Inapplicable {
reason: e.to_string(),
}),
}
}
25 changes: 25 additions & 0 deletions crates/formality-core/src/judgment/assertion.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
use std::fmt::Debug;

/// Helper trait for assertions in judgments.
/// For each `assert(x)`, we will invoke `JudgmentAssertion::assert`.
/// This allows us to support both booleans and results.
pub trait JudgmentAssertion {
fn assert(self, expr: &str);
}

impl<E: Debug> JudgmentAssertion for Result<(), E> {
#[track_caller]
fn assert(self, expr: &str) {
match self {
Ok(()) => (),
Err(e) => panic!("judgment assertion failed: `{expr}` got {e:?}"),
}
}
}

impl JudgmentAssertion for bool {
#[track_caller]
fn assert(self, expr: &str) {
assert!(self, "judgment assertion failed: `{expr}` is false");
}
}
Loading
Loading