Arrow + MP infra data needs + MP ecosystem data practices #1275

tsmathis · 2025-08-22T20:43:19Z

tsmathis
Aug 22, 2025
Maintainer

tagging staff: @esoteric-ephemera, @tschaume, @kbuma
tagging recent MPSF members/MP ecosystem devs that would likely be interested: @rkingsbury, @Andrew-S-Rosen, @JaGeo, @utf, @mkhorton, @davidwaroquiers, @gpetretto

Splitting off from pyarrow PR(#1243) for a more open-ended discussion re: data practices in the MP ecosystem.

Brief context(enter soapbox): moving MP's data products/build pipelines to the cloud has exposed a number of data management/modeling shortcomings in the broader Materials Project ecosystem. It was an understandable choice to prioritize convenience and flexibility (mongo, python, and using supercomputing facilities for everything) when the Materials Project was a fledging ecosystem, but the infrastructure team has been, and stil are, trying to reduce that entropy. We have a very real need, namely a monthly AWS bill, that we have to keep in check while serving MP's (growing)userbase and trying to be prepared to accept new types of contributions of various scales (GBs to TBs) from researchers/workflow developers.

Getting off the soapbox, the reason behind me trying to incorporate pyarrow into emmet was the prohibitive cost of using non-cloud native data formats (json and its ilk) for running builders in the cloud. Running the materials builder w/ with parquet source data vs. json source data cut the runtime from 12h45min to just 55 min That's a $65 compute bill vs. a $1.83 bill. The electronic structure builder is currently estimated at >$500 to run from scratch with json inputs due to the beefy VMs required to not OOM during deserialization. I expect a similar decrease in cost once I get all the bandstructures and dos migrated to parquet (fingers crossed of course, but intuition points that way).

Which leads to the core issue of this discussion, getting MP's data to be compatible with parquet was a chore due to the amount of (de)serialization logic I had to inject into pydantic internals to get all of the pymatgen objects used in MP to be fully structured (cf. #1243, serialization_adapters). This is partly due to some shortcomings in pyarrow, but the flexibility of pymatgen's classes/objects is antithetical to structured data. The electrode objects and models are a good example of the hoops required to get a structured data type: electrode_adapter.py | InsertionElectrodeDoc field_serializer

Related issues have been raised in regards to the complexity of property access in the tasks endpoint: #840, as well data management issues related to trajectories: #872. The tasks endpoint (w/ trajectories, etc.) has been a big pain point for the infrastructure team, and @esoteric-ephemera has recently been making efforts to get a proper "trajectories data product" fleshed out so we can remove the trajectories from the tasks collection in our production mongodb (#1206, #1257, #1260) and make them more easily accessible for users (and more manageable for us for future scaling). I am in the process of taking that further to remove calcs_reversed entirely from the tasks: #1232, which should help a bit with #840 (@davidwaroquiers). I'll post some numbers below on the benefits that remodeling the tasks collection has had/will have.

The infrastructure team can only do so much with our models in emmet and we have been circling aroung the issue of making performance oriented versions of pymatgen objects (#1174, @esoteric-ephemera I think you have version of this still going right? hoping this could replace all the injected pyarrow serde funcs in the future)

Which brings me to the main point of discussion? Do you (ppl tagged, plus any MP users that come across this) care about this? Are data format/model issues effecting you, and the users of your repos, in your day-to-day? Are you willing to contribute expertise/experience to making this better across the Materials Project ecosystem?

TLDR: data modeling is hard and the MP ecosystem hasn't made an actual effort on this front (from my POV) in its 10+ years. Dealing with it now sucks, can the MP ecosystem make its own future easier to manage?

I'll post various benchmarks and some reference object schemas to emphasize various points.

tsmathis · 2025-08-22T20:46:34Z

tsmathis
Aug 22, 2025
Maintainer Author

for our production tasks db cluster:
TaskDocs w/ calcs_reversed + trajectories: 1.43M docs, with a 53 GB storage size
CoreTaskDocs (no calcs_reversed/trajectores): 1.91M docs with 31 GB storage size

this has direct cost and performance implications (avg doc size)

0 replies

tsmathis · 2025-08-22T20:50:18Z

tsmathis
Aug 22, 2025
Maintainer Author

Some tests I was running to benchmark the resources need to run a multilevel sort on DeltaLake Tables (flavor of a parquet dataset, datalake stuff) for TaskDoc and CoreTaskDoc, quoting myself from #1232:

Compute cost saving for sorting/maintaining parquet tables: (running a multilevel sort on ["batch_id", "chemsys", "formula_pretty"]

TaskDoc:
Start time: Mon Aug 18 2025 11:20:59 GMT-0700 (Pacific Daylight Time)
End time: Mon Aug 18 2025 11:34:38 GMT-0700 (Pacific Daylight Time)
Duration: 0:13:39.300000
Total number of allocations: 22267
Total number of frames seen: 194
Peak memory usage: 438.5 GB
Python allocator: pymalloc
CoreTaskDoc:
Start time: Tue Aug 12 2025 15:55:02 GMT-0700 (Pacific Daylight Time)
End time: Tue Aug 12 2025 16:00:56 GMT-0700 (Pacific Daylight Time)
Duration: 0:05:53.672000
Total number of allocations: 14309
Total number of frames seen: 194
Peak memory usage: 228.5 GB
Python allocator: pymalloc
so 438 GB max allocation -> 228 GB allocation with runtime dropping from 13 min to 6 min.

Running a mulitlevel sort on a completely mangled, zero-optimization table is obviously a one-off expense, and I would expect the magnitude of memory required for table maintenance to drop significantly for appends, deletes, updates, etc, but this sets a good idea of the required upper limits if something drastic is needed (e.g., re-shuffling entire dataset)

0 replies

esoteric-ephemera · 2025-08-22T21:55:19Z

esoteric-ephemera
Aug 22, 2025
Maintainer

The replacement for #1174 is here and I'll put that up as a PR once I feel like it's matured enough

We have a good example of how working with MP while developing/revising a workflow helps structure the data schema: when the Approx / NEB workflows were being ported from atomate to atomate2, I was working to structure the job and workflow output in a way that's more in line with this current format (less nested data, fewer complex type unions, better validation of data)

Those models aren't perfect and will probably need some revisions as we fully transition to parquet, but this is the best case I think.

For the phonon data, we've had to migrate a legacy schema for the static DFPT phonon collection to a more cloud-ready format, and also ensure that the new format is mostly compatible with the atomate2 phonon schema. The schema migration from atomate2 to emmet will be more problematic because this may also require workflow changes to accommodate schema changes

0 replies

Andrew-S-Rosen · 2025-08-22T23:04:44Z

Andrew-S-Rosen
Aug 22, 2025

I don't have a ton to add here, but I will note that any way we can drop the memory footprint would be appreciated. I was thinking of purchasing access to a MongoDB Atlas cluster today, and the problem is that the storage costs are just far too high without a clear way of reducing the footprint in Atomate2 or Jobflow.

4 replies

tsmathis Aug 22, 2025
Maintainer Author

is your issue from your outputs having massive trajectories? similar to what Guido brought up #872?

Andrew-S-Rosen Aug 23, 2025

I don't believe so. I think it is instead because we study very large structures (100-500 atoms). Also, MongoDB Atlas charges an arm and a leg if you need a DB greater than 5 GB (free version is 512 MB).

esoteric-ephemera Aug 26, 2025
Maintainer

We're using trajectory / ionic steps kind interchangeably - this is probably the biggest contributor to TaskDoc size for standard data right?

Andrew-S-Rosen Aug 26, 2025

Yes, I think for standard data (no VASP objects), the trajectory is the largest. However, if you have a large structure, that structure will also appear in many places such that the document will be large even without many optimization steps. Between the input structure, the output structure, the calcs reversed input and output structure, the ionic steps, etc. you have structural information in many places, which quickly adds up if you do a large campaign.

tsmathis · 2025-08-22T23:10:13Z

tsmathis
Aug 22, 2025
Maintainer Author

I believe the pymatgen PhaseDiagram paints a good idea of what I am talking about re: a lack of data modeling being considered when developing codes/data structures in some of the MP ecosystems libs, starts on line 6 of this gist for the arrow schema of emmets PhaseDiagramDoc:

https://gist.github.com/tsmathis/14ec8f215e8d4e3316d643c93a0656ba

fields like all_entries are struct arrays, with individual structs having upwards of 100 columns (when stored in a columnar format -> arrow). When I was looking the phase diagram on MP for Li-Co-O, the all_entries field had 31 entries...

You should also note that the computed_data field has an all_entries field, which had (the same?) 31 entries.
So that's a top level all_entries struct array with 31 members and a computed_data.all_entries struct array with 31 members

0 replies

tsmathis · 2025-08-22T23:15:12Z

tsmathis
Aug 22, 2025
Maintainer Author

You might also note things like energy_adjustments being a string field,

if pyarrow ever implements support for union types we could potentially automagically handle this rather nasty union:

emmet/emmet-core/emmet/core/serialization_adapters/computed_entries_adapter.py

Lines 87 to 91 in 5ce3cca

    
           # "energy_adjustments": list[ 
        
           #     TypedCompositionEnergyAdjustmentDict 
        
           #     | TypedEnergyAdjustmentDict 
        
           #     | TypedTemperatureEnergyAdjustmentDict 
        
           # ],

but each of the energy adjustments also have variants...

emmet/emmet-core/emmet/core/serialization_adapters/computed_entries_adapter.py

Line 54 in 5ce3cca

    
           #         "cls": MaterialsProject2020Compatibility | MaterialsProjectDFTMixingScheme,

This one is just a mess in general

0 replies

JaGeo · 2025-08-23T06:25:03Z

JaGeo
Aug 23, 2025

User requests regarding atomate2 data models so far mostly concerned questions regarding ontologies, self-documenting workflows and reproducibility.

I think there is certainly a need to rethink the data models for all ML potential applications from my side though (e.g. Trajectories).

Re Getting involved: I can only justify this if this is of immediate benefit for our scientific work or related to other (internal) database work that we have to do for data management.

4 replies

esoteric-ephemera Aug 25, 2025
Maintainer

I'll need to migrate the ML forcefield schema to match the current VASP TaskDoc. Are there other changes you'd want to see?

JaGeo Aug 25, 2025

I think this sounds great . I guess you will adapt the ase TaskDoc as well?

esoteric-ephemera Aug 25, 2025
Maintainer

They should use a similar document model with core fields - that will help a lot with production once we deal with ML forcefield data. Will also require moving the ase and ML documents to emmet in some form

JaGeo Aug 25, 2025

I see!

Maybe we can discuss more once you have started on the PR. I don't have anything right now but potentially once I think more about it.

rkingsbury · 2025-08-25T17:07:49Z

rkingsbury
Aug 25, 2025

I appreciate hearing more about the "behind the scenes" importance of these kinds of changes and at a high level streamlining and more strictly structuring MP data makes sense to me.

Having said that, as someone who came from outside materials science and continues to do more "low to medium throughput" research, I've always found the convenience and logic behind the legacy data structures appealing and important from a training / learning perspective. I would guess that only a small fraction of our users are doing truly high throughput work (although that may be increasing in the age of ML), so I think it's important to preserve the convenience aspect. I don't know much about parquet, but my guess is it's not really possible to structure the data for parquet / cloud without losing some of the expressiveness that nesting provides - it's going to be a tradeoff.

My hope would be that we could improve performance by

Transitioning some parts of legacy MP data structures to more explicitly typed and structured python native structures (e.g., could ComputedEntry be a dataclass or BaseModel?)
Reducing redundancy in task doc definitions (as noted above re: all_entries)
Defining parallel, performant versions of core classes like Structure that can be converted 1:1 from the legacy versions, basically encapsulating a lot of the serialization logic that @tsmathis has put into emmet? Hopefully that could be a dedicated builder, run less frequently than the main ones based on a hash of the original .json object?

Again, I'm a little out of my technical depth here. Main point is that convenience is important too!

1 reply

esoteric-ephemera Aug 25, 2025
Maintainer

For the third bullet point, that's something I'm working on and Tyler mentioned. All of these classes have to_pmg and from_pmg methods to convert back to pymatgen objects

Similar logic goes for derived objects like Computed{Structure}Entry

Arrow + MP infra data needs + MP ecosystem data practices #1275

Uh oh!

tsmathis Aug 22, 2025 Maintainer

Replies: 8 comments · 9 replies

Uh oh!

tsmathis Aug 22, 2025 Maintainer Author

Uh oh!

tsmathis Aug 22, 2025 Maintainer Author

Uh oh!

esoteric-ephemera Aug 22, 2025 Maintainer

Uh oh!

Andrew-S-Rosen Aug 22, 2025

Uh oh!

tsmathis Aug 22, 2025 Maintainer Author

Uh oh!

Andrew-S-Rosen Aug 23, 2025

Uh oh!

esoteric-ephemera Aug 26, 2025 Maintainer

Uh oh!

Andrew-S-Rosen Aug 26, 2025

Uh oh!

Uh oh!

tsmathis Aug 22, 2025 Maintainer Author

Uh oh!

Uh oh!

tsmathis Aug 22, 2025 Maintainer Author

Uh oh!

JaGeo Aug 23, 2025

Uh oh!

esoteric-ephemera Aug 25, 2025 Maintainer

Uh oh!

JaGeo Aug 25, 2025

Uh oh!

Uh oh!

esoteric-ephemera Aug 25, 2025 Maintainer

Uh oh!

JaGeo Aug 25, 2025

Uh oh!

rkingsbury Aug 25, 2025

Uh oh!

esoteric-ephemera Aug 25, 2025 Maintainer

tsmathis
Aug 22, 2025
Maintainer

Replies: 8 comments 9 replies

tsmathis
Aug 22, 2025
Maintainer Author

tsmathis
Aug 22, 2025
Maintainer Author

esoteric-ephemera
Aug 22, 2025
Maintainer

Andrew-S-Rosen
Aug 22, 2025

tsmathis Aug 22, 2025
Maintainer Author

esoteric-ephemera Aug 26, 2025
Maintainer

tsmathis
Aug 22, 2025
Maintainer Author

tsmathis
Aug 22, 2025
Maintainer Author

JaGeo
Aug 23, 2025

esoteric-ephemera Aug 25, 2025
Maintainer

esoteric-ephemera Aug 25, 2025
Maintainer

rkingsbury
Aug 25, 2025

esoteric-ephemera Aug 25, 2025
Maintainer