Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebalance_coarsest question and possible issue #256

Closed
PhilipDeegan opened this issue Jan 19, 2024 · 5 comments
Closed

rebalance_coarsest question and possible issue #256

PhilipDeegan opened this issue Jan 19, 2024 · 5 comments

Comments

@PhilipDeegan
Copy link
Contributor

Hello,

We are trying to update our project to support coarsest level rebalancing.

I have a question and a potential issue I am hoping you can help me with.

Firstly, the bool rebalance_coarsest which is passed to the TimeRefinementIntegrator::advance function, does not appear to use the load balancer to check if the coarsest should be rebalanced.
Is this the case? And if so are we supposed to access the load balancers ourselves manually to see if we want to rebalance the coarsest level?
The reason I ask if that it is always true, we are seeing immediate data re-initialization on first advance, which I would think is not necessary given the load for us is generally balanced on init.

Second, we seem to be hitting a segfault during schedule creation when rebalance_coarsest is True.

We are using this function here: https://github.com/LLNL/SAMRAI/blob/develop/source/SAMRAI/xfer/RefineAlgorithm.cpp#L495
Which appears to support a null src_level.
But later in RefineSchedule::createCoarseInterpPatchLevel there is no null check if has_cached_connectors is false.

I am also seeing a message like this which I would guess is related to this has_cached_connectors check failure.

At :/path/to/samrai/source/SAMRAI/hier/PersistentOverlapConnectors.cpp line :473 message: PersistentOverlapConnectors::findConnector is resorting
to a global search to find overlaps between 0x5555566edd40 and 0x5555566edd40.
This relies on unscalable data or triggers unscalable operations.
Number of implicit global searches: 1

At :/path/to/samrai/source/SAMRAI/hier/PersistentOverlapConnectors.cpp line :473 message: PersistentOverlapConnectors::findConnector is resorting
to a global search to find overlaps between 0x555556809560 and 0x5555566edd40.
This relies on unscalable data or triggers unscalable operations.
Number of implicit global searches: 2

At :/path/to/samrai/source/SAMRAI/hier/PersistentOverlapConnectors.cpp line :473 message: PersistentOverlapConnectors::findConnector is resorting
to a global search to find overlaps between 0x5555566edd40 and 0x555556809560.
This relies on unscalable data or triggers unscalable operations.
Number of implicit global searches: 3

We only see these messages when rebalance_coarsest is True.

I'm entirely expecting this to be an issue with our misuse of SAMRAI, but any help or pointers would be apprecaited.

Segfault: https://github.com/LLNL/SAMRAI/blob/develop/source/SAMRAI/xfer/RefineSchedule.cpp#L1705

cleaned up stacktrace

#0  0x00007ffff21f3280 in std::__shared_ptr<SAMRAI::hier::BoxLevel, (__gnu_cxx::_Lock_policy)2>::get (this=0x18) at /usr/include/c++/12/bits/shared_ptr_base.h:1666
#1  0x00007ffff21f3442 in std::__shared_ptr_access<SAMRAI::hier::BoxLevel, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get (this=0x18) at /usr/include/c++/12/bits/shared_ptr_base.h:1363
#2  0x00007ffff21f25e6 in std::__shared_ptr_access<SAMRAI::hier::BoxLevel, (__gnu_cxx::_Lock_policy)2, false, false>::operator* (this=0x18) at /usr/include/c++/12/bits/shared_ptr_base.h:1350
#3  0x00007ffff23894a2 in SAMRAI::xfer::RefineSchedule::createCoarseInterpPatchLevel (this=0x5555569b1840, coarse_interp_level=std::shared_ptr<SAMRAI::hier::PatchLevel> (empty) = {...}, coarse_interp_box_level=std::shared_ptr<SAMRAI::hier::BoxLevel> (use count 1, weak count 0) = {...}, coarse_interp_to_hiercoarse=std::shared_ptr<SAMRAI::hier::Connector> (empty) = {...}, next_coarser_ln=0, hierarchy=warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<PHARE::amr::DimHierarchy<1ul>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
warning: RTTI symbol not found for class 'std::_Sp_counted_ptr_inplace<PHARE::amr::DimHierarchy<1ul>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>'
std::shared_ptr<SAMRAI::hier::PatchHierarchy> (use count 4, weak count 0) = {...}, dst_to_src=..., dst_to_coarse_interp=..., dst_level=std::shared_ptr<SAMRAI::hier::PatchLevel> (use count 138, weak count 0) = {...}) at /path/to/samrai/source/SAMRAI/xfer/RefineSchedule.cpp:1706
#4  0x00007ffff2385652 in SAMRAI::xfer::RefineSchedule::finishScheduleConstruction at /path/to/samrai/source/SAMRAI/xfer/RefineSchedule.cpp:918
#5  0x00007ffff23812d1 in SAMRAI::xfer::RefineSchedule::RefineSchedule at /path/to/samrai/source/SAMRAI/xfer/RefineSchedule.cpp:454
#6  ... at /usr/include/c++/12/bits/stl_construct.h:119
#7  ... at /usr/include/c++/12/bits/alloc_traits.h:635
#8  ... at /usr/include/c++/12/bits/shared_ptr_base.h:604
#9  ... at /usr/include/c++/12/bits/shared_ptr_base.h:971
#10 ... at /usr/include/c++/12/bits/shared_ptr_base.h:1712
#11 ... at /usr/include/c++/12/bits/shared_ptr.h:464
#12 ... at /usr/include/c++/12/bits/shared_ptr.h:1010
#13 0x00007ffff2370c34 in SAMRAI::xfer::RefineAlgorithm::createSchedule at /path/to/samrai/source/SAMRAI/xfer/RefineAlgorithm.cpp:542

full stack https://gist.github.com/PhilipDeegan/6b8385056df94526baa4c01fb399fc40

thanks

@PhilipDeegan
Copy link
Contributor Author

I have avoided the stack trace by passing the current level for all levels > 0.

Can you confirm that L0 will be rebalanced by load balancers when the rebalance_coarsest bool to TimeRefinementIntegrator::advance is always false?

I could guess that the rebalance_coarsest bool is not for general use but to verify that remaking L0 for load balancing does work prior to actually happening via threshold triggers etc

@PhilipDeegan
Copy link
Contributor Author

We are hitting an assertion (or segfault in release) in hier/OverlapConnectorAlgorithm.cpp#L685

If we ever rebalance L0, and only if there is two levels, for some reason 3 levels does not have this assertion.

@nselliott
Copy link
Collaborator

nselliott commented Jan 30, 2024

Could you try rebuilding with GriddingAlgorithm.cpp from branch bugfix/nselliott/rebalance-connector? This may fix all of your assertions and crashes. Connector is a class that holds a distributed graph representation of adjacency and overlap relationships between patches in different PatchLevels. The Connectors between levels 0 and 1 get built during hierarchy initialization and modified during regrids, when level 1 changes while level 0 is constant, but in the current state of the code, they get cleared when level 0 is rebalanced. Other parts of the code will rebuild those Connectors if they can't find them, but that's not something users can rely on to happen in all cases, so I consider this a SAMRAI bug. It looks likely to me that all of the failure modes you report here are based on those Connectors being expected but not available. The patch I added in the branch rebuilds them immediately.

Regarding the rebalance_coarsest boolean, if true it will always cause the rebalancing of level 0 to execute; there is no pre-analysis to determine if the rebalancing will be beneficial, and it can be the case that the rebalancing will end up with no changes to the load balance. If you have it always true in advanceHierarchy, it will rebalance every timestep advance, which is likely much too often. So I would recommend using logic in your code to choose which boolean value to use, either by analyzing the data you are using for your workload variable, or by choosing an interval to rebalance every N time steps.

@PhilipDeegan
Copy link
Contributor Author

Could you try rebuilding with GriddingAlgorithm.cpp from branch bugfix/nselliott/rebalance-connector?

This seems to solve the issue. Thanks for the clarifications, it does seem like Level 0 is a special case, which I guess makes sense as copying the entire level might be substantial, and even impossible if there is not enough resources available

@PhilipDeegan
Copy link
Contributor Author

Thanks for the quick turnaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants