-
Notifications
You must be signed in to change notification settings - Fork 118
Description
Hi BenchMARL team,
First of all, thank you for this amazing and highly modular library!
I am currently working on a custom MARL environment where I need agents to process spatial topologies as well as retain temporal memory. To achieve this, I am combining a GNN and a GRU using the SequenceModelConfig alongside the Masac (Off-policy) algorithm.
My model configuration looks like this:
model_config = SequenceModelConfig( model_configs=[gnn_config, gru_config], intermediate_sizes=[256], )
since MASAC is an off-policy algorithm, I have a few theoretical and implementation questions regarding how it handles the replay buffer and optimization:
-
Replay Buffer Sampling & BPTT: When using an RNN (GRU/LSTM) in an off-policy setting like MASAC, does the BenchMARL replay buffer automatically sample intact trajectories/sequences instead of independent random transitions? If so, how is the sequence length (BPTT length) defined and controlled?
-
Hidden State Management during Training: During the off-policy _optimizer_loop, does the framework use the stale hidden states stored in the replay buffer, or does it perform a "burn-in" to recompute the hidden states using the most up-to-date network weights?
-
Out-of-the-box Compatibility: Is the combination of SequenceModelConfig(GNN, GRU) + Masac fully supported out-of-the-box? Are there any specific parameters I need to tweak in the ExperimentConfig (e.g., specific batching rules for off-policy RNNs) to ensure the recurrent gradients backpropagate correctly?
Any guidance or pointers to relevant parts of the codebase would be highly appreciated. Thanks again for your time and for maintaining this great project!