Improve performance of SDAP-517#11
Open
RKuttruff wants to merge 8 commits into
Open
Conversation
added 4 commits
July 17, 2024 07:47
May end up able to walk back some of these. The dask update did what I wanted, but I updated a bunch of other deps while trying to find out. Xarray deps are somewhat complicated, so it may be best to leave the deps as-is unless something is breaking.
For Spark, ensure the dataset is opened before saving it to HDFS
added 4 commits
July 22, 2024 11:48
# Conflicts: # CHANGELOG.md
# Conflicts: # poetry.lock
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Try to minimize time spent in
xarray.open_zarrafter SDAP startup. We should runopen_zarrfor each new dataset in webapp driver at least once upon discovery to ensure validity. This is an especially prevalent issue with Spark. The Spark workers would open ALL datasets every time they were given a task, which could introduce severe performance penalties.This PR will:
open_zarr. Either make it lazy (open on use), or run it asynchronously (use a future)Note 1 - Spark Algs
Despite efforts to do so, I could not find a way to make this behavior automatic. There are some manual steps that need to be taken in the Spark algorithm definition. These are fortunately fairly simple and, if done incorrectly or if something goes wrong, the old behavior should be used as a fallback.
NexusTileService.save_to_sparkusing theSparkContextobject from theNexusCalcSparkHandleror the SDAPwebservice.nexus_tornado.app_builders.SparkContextBuilder.SparkContextBuilder.get_spark_context()method and all the datasets that will be worked with.NexusTileServiceinstance from the providedtile_service_factorywith the kwargsspark=True, collections=[...]where thecollectionskwarg is a list of all the dataset names saved in step 1.