Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: More Explicit/User-Friendly Python Bindings #195

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mabruzzo
Copy link
Collaborator

Overview

While playing around with the test-suite, I was recently reminded that I find Grackle's existing python bindings to be a little unsatisfactory. This PR proposes a newer set of python bindings (these are largely influenced by wrapper functions that I've previously written around the existing python-bindings, and the C++ bindings that I've written in Enzo-E).

My primary goal here is to gauge receptiveness to this new proposed API. The implementation is mostly a proof-of-concept. If people are receptive, I would be very happy to refine the implementation (and maybe add documentation). With this in mind, I would strongly encourage the reviewer to focus first on the overall design before looking at the code-changes.

I focus a lot on the advantages of the proposed python-bindings for users that have never directly interacted with Grackle before (i.e. there only prior experience may have been running a simulation with Grackle).1 When I first started with Grackle, I was very much in this camp and found that the existing python-bindings required me to understand a lot more about the underlying C API.

Description

The changes revolve around a new class called GrackleSolver. For a quick calculation, this is the only class that a user needs to know about (rather than chemistry_data & FluidContainer).

I will break the description into 2 sections: (i) initializing grackle & querying configuration and (ii) performing grackle-calculations. In each section, I first cover the proposed API and then compare it to the existing API

Initializing Grackle and Querying Configuration

Proposed API

The logic for initializing GrackleSolver looks like the following:

chemistry_parameters = { # vals for chemistry_data struct
    'use_grackle' : 1,
    'with_radiative_cooling' : 1,
    'primordial_chemistry' : 0,
    'metal_cooling' : 1,
    'UVbackground' : 1,
    'grackle_data_file' : 'path/to/CloudyData_UVB=HM2012.h5'
}
code_units = {           # vals for code_units struct
    'comoving_coordinates' : 0,
    'a_units' : 1.0,
    'a_value' : 1.0,
    'density_units' : 1.67e-24, # ~ 1 Hydrogen mass per cc
    'length_units' : 3.09e24,   # ~ 1 Mpc
    'time_units'   : 3.16e12    # ~ 1 Gyr
}

grackle_solver = GrackleSolver(chemistry_parameters,
							   code_units)

At this point, grackle_solver is fully initialized and is ready to be used.

The user might subsequently want to query the solver's configuration. The properties, grackle_solver.parameters, grackle_solver.rates, & grackle_solver.code_units return objects that provide access to the data stored by the chemistry_data, code_units, & chemistry_data_storage structs. The following snippets show some examples.

print(grackle_solver.parameters.primordial_chemistry)
print(grackle_solver.rates.k13)
print(grackle_solver.code_units.a_value)

In more detail, these properties return instances of a new class called _DataProxy. These instances provide access to a subset of the attributes/properties of the underlying pygrackle.grackle_wrapper.chemistry_data extension-type (that is wrapped by GrackleSolver.

At the moment, the object returned by grackle_solver.code_units is NOT mutable. With that said, it's still possible to initially configure grackle_solver with one choice of a_value and then subsequently perform calculations using a different a_value. Consequently, I would argue that there isn't any other need for a user to modify the code_units attributes.2

Currently, the object returned by grackle_solver.parameters is mutable. But, I'm very tempted to make it immutable. In my opinion, there should never be any reason for a user to need to modify these parameters after initializing grackle (if they ever want different parameters, then they should create a new GrackleSolver instance). I talk a little more about this down below.

It is possible to mutate rate values accessed through grackle_solver.rates.

Comparison with current API

Based on my description so far, GrackleSolver serves a similar purpose to chemistry_data, but there are 2 aspect clear differences:

  1. GrackleSolver's constructor produces a fully configured/initialized object. If there is a problem with initialization, an exception is raised (that provides as much info as possible). In contrast, chemistry_data's constructor creates an object that must be further configured and initialized as a separate step.
  2. GrackleSolver explicitly organizes access to Grackle's configuration/state data based on whether a given values is considered to be a parameter, related to units, or is a rate. chemistry_data provides access to configuration/state data via uncategorized attributes.

The fact thatGrackleSolver's constructor always creates a fully initialized object is unequivocally an improvement (the lack of a separate initialization step means that there is one less thing for users to do).

In my opinion, the fact that GrackleSolver categorizes attribute access (and initialization) is also an improvement.

  • Previously, the only way of knowing the difference in attributes came entirely from looking at the documentation. If you tried to access/modify an attribute at an inappropriate time (e.g. a rate before initializing chemistry_data or a parameter/unit-related quantity after initialization), there weren't any really guardrails in place. The python bindings might accept such a change and subsequent calculations could silently provide nonsensical answers. Alternatively, some could loudly fail, but it's unclear what went wrong.
  • Due to the sheer number of initialization-parameters, the user is still required to look at the documentation, but GrackleSolver at least puts more guardrails in place (it's harder to do something incorrect). The resulting code also ends up being a little more explicit.

Performing Calculations

Proposed API

This section assumes that we have a GrackleSolver instance called grackle_solver. When I describe these functions, I make a point of including default kwarg values in the function signatures

To retrieve the species density fields required by a given grackle configuration, (including metal and dust densities) one can call grackle_solver.expected_species_density_fields().

The "calculation-methods" all share some common features:

  • the first argument is called data. It is a dict-like object passed via an argument called that contains numpy arrays (an error is raised if the arrays don't share a common shape). These entries are explicitly used to populate the grackle_field_data struct. The intention is for the calculation method to raise errors if unknown key-value pairs are specified (I don't think the implementation is being very strict about that right now). Errors are also currently raised if a contained value is not a 1D numpy array (I would like to add support for non-numpy arrays and other dimensionalities)
  • they always have a kwarg argument called inplace, with a default value of False. When inplace is False, copies are explicitly made of all entries in data. When True, the wrapper tries to directly make use of these arrays in the calculation (an error is raised if there is an invalid data layout or dtype).
  • they always have a kwarg called current_a_value, with a default value of None.
    • For a GrackleSolver configured without comoving coordinates, this must always be None
    • For a GrackleSolver configured with comoving coordinates, a value of None indicates that the a_value from initialization should be used during this calculation. Alternatively, this kwarg is a way to specify that a different a_value for the current calculation (the updated code_units struct is automatically computed by the bindings)
  • they always return the result of the calculation

Solving Chemistry:

grackle_solver.solve_chemistry(data, dt, current_a_value = None,
                               inplace = False, grid_dx = None)

In more detail, dt specifies the timestep to evolve the chemistry over. Vurrently, grid_dx is a placeholder argument for when support for 3D arrays is added. When inplace is True, the values of data are updated in-place.

This method always returns a dict-like object with all of the fields used by Grackle; you should be able to pass that object as the data argument in subsequent solve_chemistry calls. Note: the value of inplace has no impact on this behavior, it just determines whether fresh arrays are allocated.

Calculating Derived Properties:

The following methods all return a newly allocated array holding computed derived values:

grackle_solver.calculate_cooling_time(data, current_a_value = None,
                                      inplace = False, grid_dx = None)
grackle_solver.calculate_dust_temperature(data, current_a_value = None,
                                          inplace = False)
grackle_solver.calculate_pressure(data, current_a_value = None,
                                  inplace = False)
grackle_solver.calculate_temperature(data, current_a_value = None,
                                     inplace = False)

As a point of clarification, the inplace kwarg only controls whether the arrays specified by the data arg are directly used or are copied when constructing a grackle_field_data struct. Arrays specified in the data arg are never used to hold the result of the calculation.

I'm tempted to make an optional kwarg called out that can be used to specify the array for holding the output data.

NOTE: Like before, the grid_dx keyword argument passed to calculate_cooling_time is mostly just a placeholder (until we add support for 3D arrays).

Comparison with current API

I recognize that most of the "benefits" of using GrackleSolver over FluidContainer are fairly subjective (and are mostly a matter of taste).

In my opinion, the GrackleSolver's single biggest advantage is that it's much more explicit than FluidContainer. In other words, the proposed API is much more obvious about what the inputs and outputs are for each of the calculations (without needing to be very familiar with the C functions).

For the FluidContainer object:

  • it was originally, very unclear to me what the inputs and outputs were for each calculation (I had to go look at the interfaces for the C functions to figure it all out). The fact that FluidContainer pre-allocates fields to hold derived quantities somewhat hinders this clarity.
  • For example, without knowing anything about the C-bindings, one could think that calculating temperature might involve calculating pressure (and that both fields might be filled).
  • Another example: without knowing anything about the C-bindings, it's not totally obvious which quantities are need/updated by a call to solve_chemistry and which quantities are out-of-date after that call

There are also some other less important (subjective) advantages of GrackleSolver over FluidContainer, but I'll skip those for now.

The primary disadvantage of GrackleSolver is speed:

  • GrackleSolver is currently a wrapper around FluidContainer (but that was just to make a proof-of-concept, I plan to fix that if this PR is well-received). GrackleSolver will also always be a little slower when inplace = False, but that's also fine.
  • I'm most concerned about when inplace = True. In this case, the GrackleSolver's calculation-methods incur additional overhead from the array-checks that ensure that the input arrays can be directly passed directly to the solver.
  • However, this is addressable:
    • This PR introduces a internal class called pygrackle.utilities.arraymap.ArrayMap that is a dict-like type with additional restrictions. These additional restrictions explicitly enforce invariants that satisfy the aforementioned array-checks.
    • Currently, we just use this class internally. But, if we let users use this class, then we can allow them to bypass array-checks when they pass instances of this class to the calculation-method.

Note on Implementation

Currently, the GrackleSolver object is implemented in terms of chemistry_data and FluidContainer. I tried to implement the new API without making too many changes to the cython file. This meant that there is quite a bit of ugly python-magic going on while determining arguments for _DataProxy that are related to chemistry_data's rate-related attributes.

If we decide that we like this new API, I'm happy to clean all of this up. The internals would still use chemistry_data (but not FluidContainer). I would also probably move the implementation FluidContainer into the cython file to reduce overhead).

Highlighted Areas for Feedback

Beyond thoughts on the overall python API, I wanted to highlight some additional areas for feedback:

  • Thoughts on naming of GrackleSolver?
  • Any obvious missing functionality?
  • Thoughts on immutability of the object returned by the grackle_solver.parameters property? I recognize that my preference for immutability is very subjective, but in my opinion it makes reasoning about the program a lot easier. In this case, it would make it a lot more difficult for users to unintentionally put Grackle in an invalid state. (Furthermore, this is python after all. If somebody really wants to mutate a value, they can probably find a way).

Future Ideas

  • Maybe polish & expose the (currently) private _wrapped_c_chemistry_data type as a part of the API and make similar types that wrap the code_units & chemistry_data_storage
    • there are 2 locations in the API where this may be relevant
      • we could support passing these objects to GrackleSolver's constructor (as an alternative of the dict instances that are currently expected)
      • we could have the grackle_solver.parameters, grackle_solver.code_units, & grackle_solver.rates properties directly returns these instances (or we could wrap them if we desire immutability -- which would be cleaner than what we currently support)
    • I would want to refactor things. Currently, _wrapped_c_chemistry_data acts like a dict (that was just the easiest thing to do when I implemented it). But, since the stored entries have fixed typing and arbitrary keys can't be added, I think it's more pythonic to have it act more like a dataclass or the typing.NamedTuple class.
  • Adding support for 2D and 3D calculations (so that we can actually run tests on self-shielding)
    • This is easy to do! A relevant question is should we explicitly tie this to internal support for multidimensional grids? Presumably, yes... (especially for testing purposes). Alternatively, we could add python code to reshape the input arrays... (I tend to think this is something we could tack on later... if it became necessary)
    • It would also be nice to add support for passing dictionaries of array-like and scalar quantities to the calculation-methods
  • Maybe add a configuration option in GrackleSolver's constructor to optionally enable unyt support? I'm not exactly sure what this would look like:
    • In the simplest case, we would allow people to pass unyt.unyt_array instances as arguments to the calculation functions. Then, we would return arrays with attached units. (I think we would need to require inplace=False for this scenario to properly work)
    • In a more fully integrated approach, one could imagine attaching a unit-registry instance to the object... (I would be a little more hesitant to take this approach)...
  • Maybe add a configuration option in GrackleSolver's constructor and/or add an option to the calculation-method to control how strict the calculation-methods are about the keys in the array. Currently the intention is to be very intolerant of unnecessary keys, by default. But, it could be useful to override that behavior.
  • Maybe add an out kwarg to the derived-property calculation methods to let users pre-allocate an array to hold the outputs of a calculation.
  • add methods to GrackleSolver in order to query the expected code units at a given cosmological scale factor.

Footnotes

  1. As I've mentioned before, I think that Grackle's python bindings is actually a major selling point for people in this position! To my knowledge, no individual simulation with adopting an advanced chemistry/heating/cooling solution provides this type of solution. I have personally found this functionality to be extremely useful for interpreting simulation results (and understanding the underlying physical results).

  2. If the fields are not comoving, then the code_units values can't change at all. When using the comoving coordinates, the a_value member of the code_units struct is the only value that is allowed to freely change. While the length_units and density_units also change, there is no flexibility; their values are entirely specified by the original code_units struct and the current a_value. All of the GrackleSolver calculation-methods allow the caller to specify a new current_a_value kwarg and will internally compute the new code_units struct with this information. It's my intention to propose similar functionality in the actual c library (but that's beyond the scope of this PR)

@mabruzzo mabruzzo force-pushed the pygrackle-GrackleSolver branch from e6373d4 to da1e663 Compare June 23, 2024 13:34
@mabruzzo mabruzzo force-pushed the pygrackle-GrackleSolver branch from 9d4e692 to ea3a364 Compare June 24, 2024 14:44
@brittonsmith
Copy link
Contributor

I think you've got a lot of really good ideas in here. I am very in favor of an overhaul to the current Python API to something that looks more like the library API, namely, the splitting of parameters and units back into their own structures.

My main thought up front is more pragmatic. Personally, I'd like to see this implemented totally independently of the existing Python interface and the FluidContainer class. From a user perspective, this will be a totally new experience. I think it's worth having the two frameworks coexist for a short period of time with the older one being deprecated and eventually removed. From a development perspective, I think it will be much easier to manage a gradual transition of the testing infrastructure that relies on the FluidContainer rather than everything going into a single PR. This would also make my life easier as I continue to work on PR #215, which is touching a lot of the same Python code.

Some minor comments on the points you raise:

  • I'm in favor of grackle_solver.parameters and grackle_solver.code_units being mutable for the reason that the Python interface is the easiest place to experiment with things. As well, we don't know exactly what kind of workflows users have developed that may rely on these being mutable. Additionally, making density and length units immutable mandates a different way to handle comoving units than we currently do elsewhere. I'd prefer it if the expected user behavior is as similar as we can make it between the Python and library API. I realize we may be about to rethink some of how units are handled, but I would prefer to have that discussion separately and for the results to propagate outward into future code versions.
  • I'm a bit wary of adding extra keyword arguments to the grackle_solver function calls for things like current_a_value and grid_dx. In my experience, this tends toward functions with lots of possible keyword arguments. As well, we end up with an inconsistent set of possible keywords across all the functions, which I think will ultimately be confusing. I'm still of the opinion that the object should contain everything but the field data, but I'm open to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal python Relates to the python bindings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants