Skip to content

type inference: fix bug in union-bound algorithm #289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 14, 2025

Conversation

apoelstra
Copy link
Collaborator

We have implemented the "halving" variant of the union-bound algorithm, but my code to do so was wrong in a way that sometimes causes types to "come ununified", i.e. a single type bound is split into two, with different instances in the program more-or-less randomly assigned to each side of the split.

This causes all sorts of weird behavior.

Fixes #286

We have logic to output `...` when a type gets too deep and to stop
recursing at that point. Most notably this is useful when outputting
an occurs-check-failing type. But the max depth is 64, so you can make a
type of size 2^64 without hitting this limit, which will still blow out
your memory and effectively stall.

This can create pretty confusing symptoms when fuzzing for pathological
programs -- you will get an apparent stall with no output, and no matter
how many debug printlns you add it will appear that they all execute
before the stall -- and it turns out that in fact all your code was
working correctly up to the point of the final panic where it tries to
*display* an occurs-check error.

Cut off the display at 10000 nodes. If somebody really needs to output a
text representation of such a thing they can write their own logic.
@apoelstra apoelstra force-pushed the 2025-04--invalid-encode branch from c61d31f to 0a3fe2a Compare April 14, 2025 15:04
In the union-bound algorithm we track our unified type variables as
sets, which are represented by trees of pointers, where the roots of the
trees are used as set representatives for determining equality etc. When
unifying variables, we move one's tree into the tree of the other, and
then copy all its bounds to the root of the new unified tree.

Most operations within union-bound are given an arbitrary set element
and immediately locate its root before doing anything. We obtain better
asymptotics by using the "halving" variant of union-bound, wherein on
each root lookup, as we follow the path from the element to its root, we
update the links to skip every other one. Basically, if node x points to
its parent which points to a grandparent, we update x to point to the
grandparent directly.

*However* our existing code, rather than making x point to the
grandparent, would just copy the grandparent over the parent. I'm not
sure what I was thinking here -- I think I got tangled up trying to deal
with Rust's multiple-aliasing rules without deadlocking and lost track
of the actual algorithm, or maybe I misread Russell's C code which I was
mostly copying from -- but the result is both weird and incorrect. In
particular, if the grandparent is a root, then by copying its data into
parent, we duplicate the root, effectively un-unifying the variable at
some hard-to-predict point.

Once the un-unification happens, then future unifications will appear to
only get half-applied, the resulting type inference will be slightly
wrong, we will produce programs which fail type-checking or
sharing-checking after being encoded, even though they were successfully
constructed.

The fix, amusingly, actually reduces the amount of locking, which might
provide a slight speedup, and will certainly make it easier to convince
yourself that this code is deadlock-free. (Probably we should drop all
these mutexes and just use raw pointers and unsafe code, which would be
easier to follow; the mutexes serve no purpose other than to mollify the
Rust typechecker; we assure type-safety by locking the entire slab, and
locking individual set elements is neither necessary nor sufficient to
get a thread-safe type inference algorithm.)

Thanks to Russell O'Connor for identifying this bug, and some thanks to
Claude 3.7-sonnet for producing weird not-quite-sensible "fixes" that
would change observed behavior on our bad test cases, and pointed us
to roughly the correct location.
You can rebase this PR to move the unit tests in front of the fix, and
you will see that they both fail.
…code/decode roundtrip

This construction algorithm is a bit fragile and will often fail but the
fuzzer should be able to make some progress with it. In particular if,
during construction, it encounters any type inference errors, it will
fail.

I used this to generate the unit tests in this PR.
@apoelstra apoelstra force-pushed the 2025-04--invalid-encode branch from 0a3fe2a to f821e22 Compare April 14, 2025 15:17
Copy link
Collaborator Author

@apoelstra apoelstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On f821e22 successfully ran local tests

Copy link

@roconnor-blockstream roconnor-blockstream left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@apoelstra
Copy link
Collaborator Author

@uncomputable can you ack (or utack) this?

Copy link
Collaborator

@uncomputable uncomputable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK f821e22

@uncomputable uncomputable merged commit 5c3ff26 into master Apr 14, 2025
35 checks passed
@uncomputable uncomputable deleted the 2025-04--invalid-encode branch April 14, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoded program cannot be decoded
3 participants