-
Notifications
You must be signed in to change notification settings - Fork 77
[ZH] Fix crashing within AIPathfind due to inadequate cleanup of pathfindinfo resources. #994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Not all replays appear to show a mismatch before the crash occurred or at the point of the crash. But the mismatch being right near the end of the replay is a good indication that it was a pathing issue that caused the crash So there seem to be issue still with cells dangling which the prependCells check was working around, but we still need to find the cause. |
Aircraft goal cells and certain cells marked with terrain flags appear to prevent the normal cleanup of pathfindinfo. The pathfinding works by parenting cells to the previous cell in the chain from the starting cell to the goal cell. This then lead to the crash within prepend cells. As the last cell in the list would be invalid and beyond the start cell and pointing to a cell without any pathing info. |
631b78b
to
94b6629
Compare
So with the recent change i believe we have resolved the biggest issues with the leaking cells. It appears that some functions were not cleaning up after themselves when returning early in some places. this was then leading to cells that still had pathfindinfo connected to them. Resulting in the dangling parent cells that we were seeing before. |
Most recent update, it appears that the |
b3ffce0
to
12782b5
Compare
The pathfinding system is based on the A* algorithm—great explanation here. In this system, potential paths are created by assigning end cells (green-colored in the video) a reference to a preceding cell, known as the "parent." Each parent cell links to another parent, forming a chain that ultimately traces back to the starting cell, where the unit begins. Additionally, cells are categorized into one of two lists: A sorted open list, containing cells to be evaluated and a an unsorted closed list, holding processed cells. Instead of using a separate list structure, the system utilizes cell references. Each cell directly points to the next in sequence, forming a linked list. The list and parent chains are maintained within a single object ( Bugs stem from improper clearing of cell references, leading to obsolete data lingering in the system. The primary issues were:
The implemented fix ensures:
The results are
|
Extra checks and cleanup code added after a recent replay was found to crash within some code that we didn't expect it to crash within. |
151ef8e
to
e52ef44
Compare
Just rebased with recent Main. |
Here is a replay that mismatches with this PR. It mismatches at 09:47. Without it runs until the end at 20:33 |
If you can upload the others, it would be good to know when they mismatch. Since in many of the crashing replays we have for the pathfinding problems, they crash within 30 seconds - 2 minutes after a mismatch was signaled when run using the fixed code. Only one mismatched nearly 10 minutes before the crash and this replay ran all the way to the end without the usual unit lockups etc. |
defc17d
to
05a6312
Compare
Updated with an initial retail and a fully fixed code pathway. The game will initially run in the retail compatible mode till a failure is caught, the code will then cleanup and recover before enabling the fully fixed pathfinding codepath. |
9ef44c8
to
976d9f6
Compare
So we had our fist test last night, we managed to get the retail clients to crash out while SH clients carried on (mostly). We did have some other issues such as my client crashed before the pathfinding crash, but this was for a different reason. We will be performing some larger scale testing with mixed retail and SH clients today, then perform some similar testing with just SH clients to check the stability. |
976d9f6
to
ac94ca3
Compare
Just rebased with recent main, still need to do a little more testing still to make sure it's all working properly when it fails over. |
…urce cleanup and null pathing info within some cells. Co-authored-by: Bart Roossien <[email protected]>
…il some conditions Co-authored-by: Bart Roossien <[email protected]>
…idual cells and add clearing of parent cell pointers when releasing pathfind cell info Co-authored-by: Bart Roossien <[email protected]>
…th() and processHierarchicalCell() Co-authored-by: Bart Roossien <[email protected]>
…n crashing areas and resetting to use fixed pathfinding
ac94ca3
to
07b5161
Compare
Just a rebase with main, still needs further testing. |
Pushed a new commit that helps avoid a game crash within the retail pathfinding codepath, this can occur due to the processed cell having a dangling link to another cell. this faux parent cell then has no pathfinding info associated with it which then results in a crash due to a function access on a null pointer. |
These issues are likely also fixed by this, it does not appear to be an issue with moving units on mass but a cascade failure in the pathfinding that can finally get triggered by moving units.
This PR fixes the crashing within AI pathing. Which is being caused by a lack of cleanup of pathfinding resources in some return paths. This then results in resource exhaustion and dangling parent cell information that points to invalid cells when a new path is calculated.
It can appear to be triggered when units get stuck or there are mass unit movements. But the pathing can start to fail long before the point of the actual crash in some circumstances.
I am keeping it as a draft for now as better ways of fixing this issue may crop up as more time is spent on it.
UPDATE1: So it appears the crashes within
prependCells
is being caused by a cell having dangling information to a parent cell that does not contain any pathing info. These may be getting into the pathing early on in the game according to some recent logging by roos. It could also be due to the starting and goal cell having dangling info on them when they are selected. They are then pointing to an invalid parent cell at this point.UPDATE2: So i have removed all of the workarounds that we considered before as i believe i came across the place which was causing the majority of the issues for some of the replays. The
checkPathCost
function andpathDestination
functions had returns that failed to cleanup the open and closed cell lists. This was likely the major cause of the problems overall.UPDATE3: It looks as though the
checkPathCost
andpathDestination
functions were the last and main cause of the issue for some of the maps. The replays now mismatch at some point halfway through or 3/4 of the way through where the cascading failure starts in the pathfinding. But the replay appears to continue normally till the score screen appears.UPDATE4: We noticed that pathing parent cell pointers were not being cleared after pathing. So we have also added cleanup for this when pathfinding info is deallocated. This helps to prevent cells that did not make it into the final path having a dangling pointer to an invalid parent cell. These invalid dangling pointers were part of the reason for the failures within prepend cells.
UPDATE5: A new crashing replay revealed that some code could crash within
internal_findHierarchicalPath
when trying to allocate path finding cell information. So added checks and cleanup code around this.UPDATE6: The entire pathfinding has been altered to have two modes of operation, the initial is a retail compatible mode which it starts within, Then if the game hits a retail crash site, we recover by catching the failure mode, cleaning up the pathfinding information, then set a flag so all pathfinding from then on uses the fully fixed codepath.
Having tested the changes with replays that don't crash due to pathfinding, i have not come across any mismatches yet, which is a good sign that the changes only affect games where the circumstances that trigger the pathfinding failure occurs. But will need further testing with more replays, some may possibly mismatch before the end, but these games could have failed if they continued and did not end.
N.b For games where the pathfinding does hit a failure path, we have noticed that the failure can start anywhere from 30 seconds to a few minutes before the crash occurs. This shows as a mismatch in the replay but the replay continues to play as normal after.
So far it minimises the amount of crashing by making sure pathing resources are cleaned properly when pathing fails, along with making sure that resources are properly released on successful paths. This resolves 2/5 of the crashes such as with the Legi Han Dynasty replay.
This also helped to alleviate some crashes that were occurring when the open and closed lists were cleaned up due to dangling cells being on the list without any pathing information attached to them.
In some circumstances we still run out of pathing resources resulting in crashes within
prependCells
, this is alleviated by checking that the cell being used has pathing information associated with it. But i am still investigating if we can prevent cells without pathing information getting this far in the chain.So far i have attempted to alter the pathfinding code to exit early if no pathing info is assigned to new cells, this ended up mismatching very early in some games. This also caused the golden replay to mismatch as well. This would have been the best solution, fail pathing early if we detect that pathing resources have been exhausted.
I still have to test if the above fixes coupled with increasing available pathing resources helps. Previous testing by increasing pathing resources without the above didn't appear to make a difference. But the above fixes help alleviate dangling pathing resources.
In one replay there is still a crash in some instances within
checkPathCost
There is some mis-ordering in which the parent cell is put onto the closed path before it has been checked to see if it has matched the goal cell. I managed to reorder this without issue but it has not prevented crashing, although it is the correct order of operation at this point.There are also blocks within
checkPathCost
that appear to have been placed there to stop pathing if the path cost has exceeded 500 cells. It appears that there was a copying mistake as within the code it simply callscontinue
.Overall i don't think this is going to be an easy one to fix 100% till we drop compat without having some null checks in places.