forked from Syndica/sig
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(shred-network): various bugs (Syndica#481)
This PR was originally motivated by errors that occur all over the entire application due to memory corruption. This happens regardless of what mode the validator is run in, as long as the shred network is running. This has been a problem since the retransmit service was combined with the shred collector, which introduced the bug. In the process of debugging this issue I also fixed various other problems in shred-network. ### Widespread Memory Corruption In `shred_network.service.start`, the `retransmit_channel` was being deinitialized before use due to `defer retransmit_channel.deinit()` being present in the start function. The start function returns immediately, so we can't be deiniting the channels here. This is a use-after-free and triggers widespread memory corruption throughout the entire application, resulting in errors like segmentation faults in gossip and any other running services. The fix for this is to not use a `defer` statement within the start function, but instead only deinit the channel when the service itself is deinitialized. This is currently accomplished by calling `service_manager.deferCall`. Perhaps it is worth considering another design to avoid the footgun of adding ordinary defers to the start function if you're not familiar with the ServiceManager pattern. See https://github.com/orgs/Syndica/projects/2/views/10?pane=issue&itemId=92733572 ### RPC Client Panics In the shred-collector command, one instance of the RPC Client was being used in multiple threads despite not being thread safe. This was causing various panics in code called by the RPC Client. I fixed this in `cmd.zig` by moving the sole call to the client in the main thread to occur before the client is passed to another thread. ### Leaks in Shred Network Channels pointers that are created for Shred Network were not being destroyed. The Channel's internal resources were being cleaned up, but the Channel pointer itself was not. I added a `destroy` method to complement the Channel's `create` method, and made use of that in the `shred_network.start` ### Thread Safety in SharedWindowPointer There was a potential problem where the `discard_buf` could potentially be used by multiple threads, leading to memory corruption. I did not observe any errors due to this, but it was a flaw in the general thread safety of the struct. Now, instead of just mutating the shared buffer from any thread, an atomic pointer ensures that only a single thread at a time has access to the buffer. If another thread really needs a buffer at the same time, it will alloc/free a dedicated buffer. This is unlikely to occur with the current way the struct is used, but it's there to ensure the struct is fully thread safe. ### Leak in SharedWindowPointer The `discard_buf` and its contents were not being freed by SharedWindowPointer. I added a test to reveal this, and fixed the leak in the deinit function. ### RpcEpochContextService issues RpcEpochContextService had various minor problems that I fixed. It would acquire a context outside the loop, then again loop over the same value and try to acquire the same context. This can be fully handled by the loop itself, so I removed the logic from outside the loop. There was also a potential memory leak on error which needed an errdefer. The intended design of the context manager is to only return an error if the *current* epoch's context cannot be acquired. Silent failure should be allowed only for other contexts. By accident, this was only returning an error if there was a failure retrieving the *last* epoch context. I added a skipped test that you can unskip to manually run the RpcEpochContextManager. It helped me understand some of these bugs. Ideally in the future this test can be unskipped but it will require some changes to the RPC client to allow mocking of the RPC endpoint.
- Loading branch information
Showing
6 changed files
with
132 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters