Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proof of concept python extension for frcw #1

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

msarahan
Copy link
Contributor

One of the things that I discussed with @InnovativeInventor at mggg/GerryChain#379 was to make it easier to use frcw from python. As I mentioned there, I have some experience with wrapping Rust to make it available to Python.

This PR is meant to create discussion around this wrapping idea:

  • is it worthwhile?
  • does it avoid needing pcompress to go between GerryChain and frcw? Saving any trips to/from disk can be a big win.
  • What is the minimum viable amount of wrapping to make this useful?
  • What would a "dream/complete" wrapping look like?
  • How do we avoid code duplication (and perhaps also functionality divergence) between here and GerryChain, while also not requiring people to write Rust to test their new ideas?

Note that I changed the error handling for graph-related errors. This is hopefully a nice change from the string errors that are more annoying to catch and deal with. It makes error handling for the PyO3 wrapping stuff easier, too.

There's a lot of nitty-gritty details around mutability, ownership/references/copies, etc. that I haven't spent time examining yet. My Rust is rusty, and was never great to begin with, so I've mostly been just trying to get a simple PoC working.

You can test this from the frcw.rs root folder:

pip install maturin
maturin develop
python -c "import frcw; g=frcw.Graph(10); g.edges_start"

ndarray = { version = "0.14", optional = true }
ndarray-linalg = { version = "0.13", features = ["openblas-system"], optional = true }
pcompress = "1.0.6"
petgraph = "0.6.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorted these, and the IDE bumped some of them. I'm happy to revert if this fluff bugs you. The import new stuff is the addition of pyo3 and snafu.

@InnovativeInventor
Copy link
Contributor

does it avoid needing pcompress to go between GerryChain and frcw? Saving any trips to/from disk can be a big win.

pcompress does not need to go to/from disk (the Python wrapper around pcompress is able to read in any executable that spits out assignment vectors, plus pcompress the executable is really a tool for compressing Unix pipe streams that have assignment vectors).

However, I'm hoping that we can figure out a tighter integration between the Rust and Python code than just passing assignment vectors (e.g. some way of prescoring updaters in Rust), but also allowing for it to be used as a native networkx object in Python. This looks like an excellent start!

@msarahan
Copy link
Contributor Author

I'm not certain, but if you're relying on string output from any executable, you're paying a decent price for serialization/deserialization. Avoiding that can help a lot, but it'd be hard to say how much without profiling it.

I think anything is possible, but I don't have a good sense of how hard things might be. Dream away, and I'll give it a try.

@InnovativeInventor
Copy link
Contributor

Of course -- you're totally right. Passing Python objects like you're doing from Rust would be awesome and probably much faster. I'm not sure how much this matters, though (an unoptimized pcompress parser/compressor that acts on the streams has reasonable performance even reading from disk, and there is some low-hanging fruit if we need to go faster).

Ultimately, I think the slowest part of our workflow (as I'm seeing in Gerrychain Python and GerryChain Julia) is the sheer number of updaters people like to run in their analysis. For example, this is a fairly standard set of updaters (we usually run every single updater we have an implementation for, with some exceptions). The other big slowdown is when people run optimization runs (e.g. trying to optimize for VRA effectiveness or some fairness metric). In these two areas, the non-Python GerryChain implementations have a long ways to go before adoption, which is the reason most people still stick to GerryChain Python.

I was working on adding pyo3 to the pcompress Rust code (the idea was to initialize the Partition object in Rust and pre-calculate as much as possible in parallel before sending it and exposing it as a PyIterator). This could speed things up even more (and achieve the goal of allowing GerryChain Python users to stay with Python). Unfortunately, I was having some difficulty with this, and it's still a work-in-progress.

@msarahan
Copy link
Contributor Author

msarahan commented Jan 10, 2022 via email

@pjrule
Copy link
Owner

pjrule commented Jan 10, 2022

@msarahan Thanks for jumping in on this—extremely exciting that someone other than me is excited about hacking on frcw! 😁 I particularly appreciate the tweaks to the error handling, which are of independent interest. (If it's not too much trouble, maybe one of us should create a dedicated PR for that?)

As @InnovativeInventor mentioned, running "raw chains" is just part of the Python GerryChain overhead. For a Python/Rust integration to useful to most of our end users, we probably need to precompute the values of least some updaters (e.g. tallies) on the Rust side. There's also the question of constraints and acceptance functions, which tend to be in tight inner loops and inherently non-parallelizable in the MCMC setting. I'm working on a highly experimental (and probably overkill...) extension to GerryChain that aims to compile operations over tallies, etc. down to a computation graph that can be mapped to vectorized operations on the Rust side. (This approach is heavily inspired by projects like TorchScript, JAX, tf.autograph, and Zygote.jl.) Needless to say, this is complicated, and I'm largely undertaking this effort to further my personal interest in the new wave of domain-specific abstract interpretation/JIT work in the MLSys community.

I could imagine something simpler in the short term—e.g. users declare their updaters and simple acceptance functions in a purely declarative markup language like the JSON format @InnovativeInventor linked to in the post above. The Rust engine then computes tallies, etc. specified in this format and exposes an object to Python with these values (with roughly the effect of prepopulating the _cache field in the GerryChain Partition object, though the implementation details may differ depending on how we leverage PyO3).

A meta-note: if you're willing, it might be productive to set up a Zoom call soon for the three of us to talk some of these ideas out! :)

@msarahan
Copy link
Contributor Author

For the error handling, it will no doubt fester quickly into a monster PR. Perhaps it is best to work through it one file/module at a time, so that reviewing things can be done with more care.

I'm out of my depth on the GerryOpt stuff. With the constraints/acceptance functions, you can maybe go faster by running several at once, and take only the first that is valid. That way you are not waiting on bad steps, only to go by and try another proposal. It's wasteful, of course, but if you have idle cores, perhaps helpful.

Zoom call would be good. Slack would be good too. I'm on the very old VRDI and hackathon slacks, but no slack that appears active. msarahan at gmail.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants