Skip to content

[ZH] Fix crashing within AIPathfind due to inadequate cleanup of pathfindinfo resources. #994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Mauller
Copy link

@Mauller Mauller commented Jun 1, 2025

These issues are likely also fixed by this, it does not appear to be an issue with moving units on mass but a cascade failure in the pathfinding that can finally get triggered by moving units.


This PR fixes the crashing within AI pathing. Which is being caused by a lack of cleanup of pathfinding resources in some return paths. This then results in resource exhaustion and dangling parent cell information that points to invalid cells when a new path is calculated.

It can appear to be triggered when units get stuck or there are mass unit movements. But the pathing can start to fail long before the point of the actual crash in some circumstances.

I am keeping it as a draft for now as better ways of fixing this issue may crop up as more time is spent on it.


UPDATE1: So it appears the crashes within prependCells is being caused by a cell having dangling information to a parent cell that does not contain any pathing info. These may be getting into the pathing early on in the game according to some recent logging by roos. It could also be due to the starting and goal cell having dangling info on them when they are selected. They are then pointing to an invalid parent cell at this point.

UPDATE2: So i have removed all of the workarounds that we considered before as i believe i came across the place which was causing the majority of the issues for some of the replays. The checkPathCost function and pathDestination functions had returns that failed to cleanup the open and closed cell lists. This was likely the major cause of the problems overall.

UPDATE3: It looks as though the checkPathCost and pathDestination functions were the last and main cause of the issue for some of the maps. The replays now mismatch at some point halfway through or 3/4 of the way through where the cascading failure starts in the pathfinding. But the replay appears to continue normally till the score screen appears.

UPDATE4: We noticed that pathing parent cell pointers were not being cleared after pathing. So we have also added cleanup for this when pathfinding info is deallocated. This helps to prevent cells that did not make it into the final path having a dangling pointer to an invalid parent cell. These invalid dangling pointers were part of the reason for the failures within prepend cells.

UPDATE5: A new crashing replay revealed that some code could crash within internal_findHierarchicalPath when trying to allocate path finding cell information. So added checks and cleanup code around this.

UPDATE6: The entire pathfinding has been altered to have two modes of operation, the initial is a retail compatible mode which it starts within, Then if the game hits a retail crash site, we recover by catching the failure mode, cleaning up the pathfinding information, then set a flag so all pathfinding from then on uses the fully fixed codepath.

Having tested the changes with replays that don't crash due to pathfinding, i have not come across any mismatches yet, which is a good sign that the changes only affect games where the circumstances that trigger the pathfinding failure occurs. But will need further testing with more replays, some may possibly mismatch before the end, but these games could have failed if they continued and did not end.

N.b For games where the pathfinding does hit a failure path, we have noticed that the failure can start anywhere from 30 seconds to a few minutes before the crash occurs. This shows as a mismatch in the replay but the replay continues to play as normal after.


So far it minimises the amount of crashing by making sure pathing resources are cleaned properly when pathing fails, along with making sure that resources are properly released on successful paths. This resolves 2/5 of the crashes such as with the Legi Han Dynasty replay.

This also helped to alleviate some crashes that were occurring when the open and closed lists were cleaned up due to dangling cells being on the list without any pathing information attached to them.

In some circumstances we still run out of pathing resources resulting in crashes within prependCells, this is alleviated by checking that the cell being used has pathing information associated with it. But i am still investigating if we can prevent cells without pathing information getting this far in the chain.

So far i have attempted to alter the pathfinding code to exit early if no pathing info is assigned to new cells, this ended up mismatching very early in some games. This also caused the golden replay to mismatch as well. This would have been the best solution, fail pathing early if we detect that pathing resources have been exhausted.

I still have to test if the above fixes coupled with increasing available pathing resources helps. Previous testing by increasing pathing resources without the above didn't appear to make a difference. But the above fixes help alleviate dangling pathing resources.

In one replay there is still a crash in some instances within checkPathCost There is some mis-ordering in which the parent cell is put onto the closed path before it has been checked to see if it has matched the goal cell. I managed to reorder this without issue but it has not prevented crashing, although it is the correct order of operation at this point.

There are also blocks within checkPathCost that appear to have been placed there to stop pathing if the path cost has exceeded 500 cells. It appears that there was a copying mistake as within the code it simply calls continue.

Overall i don't think this is going to be an easy one to fix 100% till we drop compat without having some null checks in places.

@roossienb roossienb added Critical Severity: Minor < Major < Critical < Blocker Gen Relates to Generals ZH Relates to Zero Hour Fix Is fixing something, but is not user facing labels Jun 1, 2025
@roossienb roossienb added this to the Code foundation build up milestone Jun 1, 2025
@Mauller Mauller added Crash This is a crash, very bad and removed Fix Is fixing something, but is not user facing labels Jun 2, 2025
@roossienb
Copy link

This will cause a mismatch about 20 to 30 frames before the end of the replay in replays that crashed because of pathfinding issues. Examples are the Han dynasty replay in issue #968 or golden replay #2

@Mauller
Copy link
Author

Mauller commented Jun 3, 2025

Not all replays appear to show a mismatch before the crash occurred or at the point of the crash. But the mismatch being right near the end of the replay is a good indication that it was a pathing issue that caused the crash

So there seem to be issue still with cells dangling which the prependCells check was working around, but we still need to find the cause.

@Mauller
Copy link
Author

Mauller commented Jun 4, 2025

Aircraft goal cells and certain cells marked with terrain flags appear to prevent the normal cleanup of pathfindinfo.
This meant they could have dangling pointers to a parent cell. So if they were used as a starting cell they could point to a cell that was not meant to be part of the pathing chain.

The pathfinding works by parenting cells to the previous cell in the chain from the starting cell to the goal cell.
The path is then optimised from the goal cell back to the starting cell within prepend cells.

This then lead to the crash within prepend cells. As the last cell in the list would be invalid and beyond the start cell and pointing to a cell without any pathing info.

@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch from 631b78b to 94b6629 Compare June 4, 2025 18:45
@Mauller
Copy link
Author

Mauller commented Jun 4, 2025

So with the recent change i believe we have resolved the biggest issues with the leaking cells.

It appears that some functions were not cleaning up after themselves when returning early in some places. this was then leading to cells that still had pathfindinfo connected to them. Resulting in the dangling parent cells that we were seeing before.

@Mauller
Copy link
Author

Mauller commented Jun 5, 2025

Most recent update, it appears that the checkPathCost and pathDestination functions were some of the major causes of the crashes in some of the replays. The paths that did not have proper cleanup on them must have been getting hit quite early on in some circumstances which was then triggering the cascading failure in the pathfinding.

@Mauller Mauller changed the title [ZH] Reduce crashing within AIPathfind due to dangling pathing resources and null pathing info within some cells. [ZH] Fix crashing within AIPathfind due to inadequate cleanup of pathfindinfo resources. Jun 5, 2025
@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch 2 times, most recently from b3ffce0 to 12782b5 Compare June 6, 2025 06:25
@roossienb
Copy link

roossienb commented Jun 6, 2025

The pathfinding system is based on the A* algorithm—great explanation here.

In this system, potential paths are created by assigning end cells (green-colored in the video) a reference to a preceding cell, known as the "parent." Each parent cell links to another parent, forming a chain that ultimately traces back to the starting cell, where the unit begins.

Additionally, cells are categorized into one of two lists: A sorted open list, containing cells to be evaluated and a an unsorted closed list, holding processed cells.

Instead of using a separate list structure, the system utilizes cell references. Each cell directly points to the next in sequence, forming a linked list. The list and parent chains are maintained within a single object (m_info).

Bugs stem from improper clearing of cell references, leading to obsolete data lingering in the system. The primary issues were:

  • Some areas of the source code failed to clear cells, leaving outdated parent and list references.
  • Given the interconnected chains, clearing cells in the wrong order caused certain cells to remain uncleared, leading to dangling references.
  • While m_info contains other essential data, only certain references should be cleared. However, parent cell references remained uncleared, causing errors.

The implemented fix ensures:

  • Complete clearing of all cells when exiting the pathfinding function.
  • Correct order of clearing, preserving chain integrity.
  • Removal of all parent cell references, preventing invalid paths.
  • Correct initialization of cells

The results are

  • Games and replays no longer crash
  • Replays that ended with a crash due to pathfinding may mismatch about 1 - 120 seconds before the actual crash would occur. If the replay is continued, it will not crash and ends with the score screen.
  • Games with combined SH and Original game versions will mismatch rather than crash.

@Mauller
Copy link
Author

Mauller commented Jun 8, 2025

Extra checks and cleanup code added after a recent replay was found to crash within some code that we didn't expect it to crash within.

@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch from 151ef8e to e52ef44 Compare June 9, 2025 18:42
@Mauller
Copy link
Author

Mauller commented Jun 9, 2025

Just rebased with recent Main.

@helmutbuhler
Copy link

Here is a replay that mismatches with this PR. It mismatches at 09:47. Without it runs until the end at 20:33
00-41-30_2v2_Nic_BOMD2MAS_HardAI_HardAI.zip
Out of 300 tested ones I got 7 mismatching replays, I can upload more if desired.

@Mauller
Copy link
Author

Mauller commented Jun 10, 2025

Here is a replay that mismatches with this PR. It mismatches at 09:47. Without it runs until the end at 20:33 00-41-30_2v2_Nic_BOMD2MAS_HardAI_HardAI.zip Out of 300 tested ones I got 7 mismatching replays, I can upload more if desired.

If you can upload the others, it would be good to know when they mismatch. Since in many of the crashing replays we have for the pathfinding problems, they crash within 30 seconds - 2 minutes after a mismatch was signaled when run using the fixed code.

Only one mismatched nearly 10 minutes before the crash and this replay ran all the way to the end without the usual unit lockups etc.

@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch 4 times, most recently from defc17d to 05a6312 Compare June 13, 2025 18:51
@Mauller
Copy link
Author

Mauller commented Jun 13, 2025

Updated with an initial retail and a fully fixed code pathway.

The game will initially run in the retail compatible mode till a failure is caught, the code will then cleanup and recover before enabling the fully fixed pathfinding codepath.

@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch 3 times, most recently from 9ef44c8 to 976d9f6 Compare June 15, 2025 16:05
@Mauller
Copy link
Author

Mauller commented Jun 18, 2025

So we had our fist test last night, we managed to get the retail clients to crash out while SH clients carried on (mostly).

We did have some other issues such as my client crashed before the pathfinding crash, but this was for a different reason.
Then one of the other guys crashed out for a different reason as well. But the retail compatible fixed pathfinding clients appeared to failover and continue working as expected.

We will be performing some larger scale testing with mixed retail and SH clients today, then perform some similar testing with just SH clients to check the stability.

@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch from 976d9f6 to ac94ca3 Compare June 19, 2025 18:40
@Mauller
Copy link
Author

Mauller commented Jun 19, 2025

Just rebased with recent main, still need to do a little more testing still to make sure it's all working properly when it fails over.

Mauller and others added 6 commits June 21, 2025 10:30
…urce cleanup and null pathing info within some cells.

Co-authored-by: Bart Roossien <[email protected]>
…idual cells and add clearing of parent cell pointers when releasing pathfind cell info

Co-authored-by: Bart Roossien <[email protected]>
…th() and processHierarchicalCell()

Co-authored-by: Bart Roossien <[email protected]>
…n crashing areas and resetting to use fixed pathfinding
@Mauller Mauller force-pushed the reduce-pathfinding-crashes branch from ac94ca3 to 07b5161 Compare June 21, 2025 09:30
@Mauller
Copy link
Author

Mauller commented Jun 21, 2025

Just a rebase with main, still needs further testing.

@Mauller
Copy link
Author

Mauller commented Jun 22, 2025

Pushed a new commit that helps avoid a game crash within the retail pathfinding codepath, this can occur due to the processed cell having a dangling link to another cell. this faux parent cell then has no pathfinding info associated with it which then results in a crash due to a function access on a null pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Crash This is a crash, very bad Critical Severity: Minor < Major < Critical < Blocker Gen Relates to Generals ZH Relates to Zero Hour
Projects
None yet
4 participants