Replies: 8 comments 9 replies
-
|
for our production tasks db cluster: this has direct cost and performance implications (avg doc size) |
Beta Was this translation helpful? Give feedback.
-
|
Some tests I was running to benchmark the resources need to run a multilevel sort on DeltaLake Tables (flavor of a parquet dataset, datalake stuff) for TaskDoc and CoreTaskDoc, quoting myself from #1232:
|
Beta Was this translation helpful? Give feedback.
-
|
The replacement for #1174 is here and I'll put that up as a PR once I feel like it's matured enough We have a good example of how working with MP while developing/revising a workflow helps structure the data schema: when the Approx / NEB workflows were being ported from atomate to atomate2, I was working to structure the job and workflow output in a way that's more in line with this current format (less nested data, fewer complex type unions, better validation of data) Those models aren't perfect and will probably need some revisions as we fully transition to parquet, but this is the best case I think. For the phonon data, we've had to migrate a legacy schema for the static DFPT phonon collection to a more cloud-ready format, and also ensure that the new format is mostly compatible with the atomate2 phonon schema. The schema migration from atomate2 to emmet will be more problematic because this may also require workflow changes to accommodate schema changes |
Beta Was this translation helpful? Give feedback.
-
|
I don't have a ton to add here, but I will note that any way we can drop the memory footprint would be appreciated. I was thinking of purchasing access to a MongoDB Atlas cluster today, and the problem is that the storage costs are just far too high without a clear way of reducing the footprint in Atomate2 or Jobflow. |
Beta Was this translation helpful? Give feedback.
-
|
I believe the pymatgen PhaseDiagram paints a good idea of what I am talking about re: a lack of data modeling being considered when developing codes/data structures in some of the MP ecosystems libs, starts on line 6 of this gist for the arrow schema of emmets PhaseDiagramDoc: https://gist.github.com/tsmathis/14ec8f215e8d4e3316d643c93a0656ba fields like You should also note that the |
Beta Was this translation helpful? Give feedback.
-
|
You might also note things like if pyarrow ever implements support for union types we could potentially automagically handle this rather nasty union: but each of the energy adjustments also have variants... This one is just a mess in general |
Beta Was this translation helpful? Give feedback.
-
|
User requests regarding atomate2 data models so far mostly concerned questions regarding ontologies, self-documenting workflows and reproducibility. I think there is certainly a need to rethink the data models for all ML potential applications from my side though (e.g. Trajectories). Re Getting involved: I can only justify this if this is of immediate benefit for our scientific work or related to other (internal) database work that we have to do for data management. |
Beta Was this translation helpful? Give feedback.
-
|
I appreciate hearing more about the "behind the scenes" importance of these kinds of changes and at a high level streamlining and more strictly structuring MP data makes sense to me. Having said that, as someone who came from outside materials science and continues to do more "low to medium throughput" research, I've always found the convenience and logic behind the legacy data structures appealing and important from a training / learning perspective. I would guess that only a small fraction of our users are doing truly high throughput work (although that may be increasing in the age of ML), so I think it's important to preserve the convenience aspect. I don't know much about My hope would be that we could improve performance by
Again, I'm a little out of my technical depth here. Main point is that convenience is important too! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
tagging staff: @esoteric-ephemera, @tschaume, @kbuma
tagging recent MPSF members/MP ecosystem devs that would likely be interested: @rkingsbury, @Andrew-S-Rosen, @JaGeo, @utf, @mkhorton, @davidwaroquiers, @gpetretto
Splitting off from pyarrow PR(#1243) for a more open-ended discussion re: data practices in the MP ecosystem.
Brief context(enter soapbox): moving MP's data products/build pipelines to the cloud has exposed a number of data management/modeling shortcomings in the broader Materials Project ecosystem. It was an understandable choice to prioritize convenience and flexibility (mongo, python, and using supercomputing facilities for everything) when the Materials Project was a fledging ecosystem, but the infrastructure team has been, and stil are, trying to reduce that entropy. We have a very real need, namely a monthly AWS bill, that we have to keep in check while serving MP's (growing)userbase and trying to be prepared to accept new types of contributions of various scales (GBs to TBs) from researchers/workflow developers.
Getting off the soapbox, the reason behind me trying to incorporate pyarrow into emmet was the prohibitive cost of using non-cloud native data formats (json and its ilk) for running builders in the cloud. Running the materials builder w/ with parquet source data vs. json source data cut the runtime from 12h45min to just 55 min That's a $65 compute bill vs. a $1.83 bill. The electronic structure builder is currently estimated at >$500 to run from scratch with json inputs due to the beefy VMs required to not OOM during deserialization. I expect a similar decrease in cost once I get all the bandstructures and dos migrated to parquet (fingers crossed of course, but intuition points that way).
Which leads to the core issue of this discussion, getting MP's data to be compatible with parquet was a chore due to the amount of (de)serialization logic I had to inject into pydantic internals to get all of the pymatgen objects used in MP to be fully structured (cf. #1243, serialization_adapters). This is partly due to some shortcomings in pyarrow, but the flexibility of pymatgen's classes/objects is antithetical to structured data. The electrode objects and models are a good example of the hoops required to get a structured data type: electrode_adapter.py | InsertionElectrodeDoc field_serializer
Related issues have been raised in regards to the complexity of property access in the tasks endpoint: #840, as well data management issues related to trajectories: #872. The tasks endpoint (w/ trajectories, etc.) has been a big pain point for the infrastructure team, and @esoteric-ephemera has recently been making efforts to get a proper "trajectories data product" fleshed out so we can remove the trajectories from the tasks collection in our production mongodb (#1206, #1257, #1260) and make them more easily accessible for users (and more manageable for us for future scaling). I am in the process of taking that further to remove
calcs_reversedentirely from the tasks: #1232, which should help a bit with #840 (@davidwaroquiers). I'll post some numbers below on the benefits that remodeling the tasks collection has had/will have.The infrastructure team can only do so much with our models in emmet and we have been circling aroung the issue of making performance oriented versions of pymatgen objects (#1174, @esoteric-ephemera I think you have version of this still going right? hoping this could replace all the injected pyarrow serde funcs in the future)
Which brings me to the main point of discussion? Do you (ppl tagged, plus any MP users that come across this) care about this? Are data format/model issues effecting you, and the users of your repos, in your day-to-day? Are you willing to contribute expertise/experience to making this better across the Materials Project ecosystem?
TLDR: data modeling is hard and the MP ecosystem hasn't made an actual effort on this front (from my POV) in its 10+ years. Dealing with it now sucks, can the MP ecosystem make its own future easier to manage?
I'll post various benchmarks and some reference object schemas to emphasize various points.
Beta Was this translation helpful? Give feedback.
All reactions