Skip to content

Add agent simulation learning workflow#30

Merged
yichao-liang merged 71 commits intomasterfrom
sim-learning
May 5, 2026
Merged

Add agent simulation learning workflow#30
yichao-liang merged 71 commits intomasterfrom
sim-learning

Conversation

@yichao-liang
Copy link
Copy Markdown
Collaborator

@yichao-liang yichao-liang commented May 4, 2026

Summary

  • The sim_learning approach with oracle rules and learned parameters can solve boil tasks.
  • Add an agent simulation-learning approach, training utilities, and bilevel sketch/tooling support.
  • Add simulator parameter-fitting support across PyBullet environments, including boil ground-truth simulator integration and related configuration updates.
  • Add focused tests for agent simulation learning, code simulation learning, and the bilevel explorer path.

Delegate option execution to option_model.get_next_state_and_num_actions
instead of duplicating its termination logic (stuck detection, Wait
atom-change checks) and directly accessing its simulator.
…inement

Extract the duplicated backtracking loop from run_low_level_search (SeSamE)
and _refine_sketch (agent bilevel) into a single run_backtracking_refinement
function in planning.py. Both callers now delegate to it with their own
sample_fn and validate_fn callbacks, eliminating ~80 lines of duplicated
loop/backtracking logic.
Replace 60 lines of manual option-model execution with a call to
run_backtracking_refinement using max_tries=[1] and a sample_fn that
returns the pre-grounded options. Remove unused Any import.
Move the _current_observation assignment into _reset_state so callers
don't need to remember the two-step pattern.  Clarify the relationship
between _current_observation (backing field) and _current_state (typed
read accessor) in docstrings and comments.
Adds agent_bilevel_plan_sketch_file setting that, when set to a file
path, loads the plan sketch directly from that file, bypassing the
foundation model query. Includes test data files and a unit test.
Extract repeated wait-termination check into _check_wait_termination helper
and unify the three _terminal branches into a single definition with
config checks inside the function body.
- Remove dead/commented-out code and stale self-question comments
- Add _VIRTUAL_OBJECT_TYPES constant to replace hardcoded type-name
  skip lists in _set_state and _get_state
- Move env-specific _get_robot_state_dict branches to subclass overrides
  in pybullet_cover and pybullet_blocks
- Extract _get_camera_matrices helper to deduplicate render methods
- Extract _get_object_state_dict from _get_state for per-object logic
- Move create_pybullet_block/sphere to pybullet_helpers/objects.py
- Merge _create_task_specific_objects into _set_domain_specific_state
- Rename: _reset_state -> _set_state,
  _reset_custom_env_state -> _set_domain_specific_state,
  _extract_feature -> _get_domain_specific_feature
- Add docstrings explaining where each method is called from
Reorganize methods into labeled sections (Setup, Public API, Core Loop,
State Write/Read, Grasp Management, Action Helpers, Rendering, Utilities)
so related functions are adjacent. Update module docstring to document
the main public API and state synchronization methods.
Add _step_base() and _domain_specific_step() to PyBulletEnv base class.
step() now calls _step_base (robot control, physics, grasp) then
_domain_specific_step (water filling, heating, etc.), gated by
_skip_domain_specific_dynamics flag for kinematics-only mode.

Migrate all 15 domain envs to override _domain_specific_step() instead
of step(). Envs with pre-step logic (coffee, switch, blocks, cover)
still override step() for the pre-step part only.
Document the step_base → domain_specific_step → get_observation flow,
_skip_domain_specific_dynamics flag, and _domain_specific_step as an
optional override.
Replace direct access to private _skip_domain_specific_dynamics
attribute with a public constructor parameter, so callers declare
kinematics-only mode at creation time instead of mutating internal
state after construction.
…ging

Both AgentSessionMixin and AgentExplorer had near-identical wrappers that
ran session.query() synchronously via nest_asyncio or asyncio.run. Move
that logic into a module-level run_query_sync helper in session_manager
and have both callers delegate to it.
Distinguishes the grounded-plan explorer from upcoming bilevel variants.
AgentExplorer -> AgentPlanExplorer, get_name() 'agent' -> 'agent_plan',
file moved to agent_plan_explorer.py, and all callers / docstrings /
YAML config examples updated accordingly.
The mixin is pure agent-session plumbing (session creation, lifecycle,
explorer factory) and has no approach-specific logic, so it belongs
next to session_manager.py, tools.py, and the sandbox managers rather
than in approaches/.
The explorer asks a Claude agent for a plan sketch, refines it against
the approach's current (possibly learned) option model, and rolls the
refined plan out in the real env. When the mental model disagrees with
reality — e.g. the sketch expects JugFilled after a Wait but the mental
model's process dynamics can't produce it — the explorer truncates the
plan at the deepest unsatisfiable subgoal (inclusive) so the real-env
rollout ends exactly where the disagreement occurs, maximising signal
per experiment.

Key pieces:

- predicators/agent_sdk/bilevel_sketch.py: extracted the sketch build
  / parse / refine helpers from AgentBilevelApproach as module-level
  functions so both the approach (solve path) and the new explorer
  (exploration path) can share them. refine_sketch gains
  truncate_on_subgoal_fail: the on_step_fail callback snapshots the
  deepest subgoal failure seen during backtracking, and on exhaustion
  the captured prefix is returned as the experiment plan.

- predicators/explorers/agent_bilevel_explorer.py: new explorer.
  Reads option_model from tool_context (synced by the approach),
  builds the sketch prompt via bilevel_sketch, runs refine_sketch with
  check_subgoals=True, check_final_goal=False, truncate_on_subgoal_fail
  =True, wraps the result in an option_plan_to_policy that converts
  OptionExecutionFailure into RequestActPolicyFailure so the episode
  cleanly terminates at the point of real-env divergence. Stashes the
  sketch subgoals/options on ToolContext for downstream diffing by
  the learning approach.

- predicators/approaches/agent_bilevel_approach.py: shim methods over
  bilevel_sketch; behaviour unchanged.

- predicators/approaches/agent_planner_approach.py: _create_explorer
  dispatches both "agent_plan" and "agent_bilevel" through the agent
  factory path and forwards CFG.explorer as the name.

- predicators/explorers/__init__.py: factory branch merged for the
  two agent-session-backed explorers.

- predicators/agent_sdk/tools.py: ToolContext gains
  last_sketch_subgoals / last_sketch_options fields, populated by the
  explorer and marked TODO for the learning approach to consume.

- tests/explorers/test_agent_bilevel_explorer.py: happy-path, fallback,
  wait-memory-injection, and deepest-subgoal-failure truncation tests.
- New setting agent_bilevel_explorer_max_samples_per_step (default 50),
  separate from the solve-path budget, so the explorer's backtracking
  cost is independently tunable.
- Log the actual experiment plan (option names, objects, params) after
  refinement so the explorer's output is visible alongside the
  existing sketch/truncation log lines.
- Test config updated to set both budgets explicitly.
AgentSimLearningApproach extends AgentBilevelApproach to learn process
dynamics online. Each cycle: the agent synthesizes parameterized
process rules via Claude (using run_python / evaluate_simulator /
test_simulator MCP tools), parameters are fitted via emcee MCMC, and
the learned dynamics are composed with a kinematics-only PyBullet
oracle into a combined option model for plan refinement.

Key pieces:
- predicators/approaches/agent_sim_learning_approach.py: the approach.
  Initialises with a kinematics-only option model (so
  AgentBilevelExplorer sees disagreements at process-dynamic subgoals
  like JugFilled/Boiled), and replaces it with the kin+learned model
  after each successful synthesis cycle.
- predicators/agent_sdk/tools.py: create_synthesis_tools() builds the
  three MCP tools the synthesis agent uses; extra_mcp_tools field and
  get_allowed_tool_list(extra_names=) plumbing lets the approach
  inject them into the session.
- predicators/code_sim_learning/: ParamSpec, fit_params (emcee MCMC),
  compute_mse, LearnedSimulator.
- predicators/ground_truth_models/boil/gt_simulator.py: ground-truth
  process-dynamics simulator for the boil environment.
- tests/: approach and param-fitting tests.
- agents.yaml: comment out agent_bilevel preset, add agent_sim_learning
  with explorer=agent_bilevel and skip_test_until_last_ite_or_early_stopping.
- common.yaml: disable failure/test video recording, set
  num_online_learning_cycles=1 for faster iteration.
Simulation primitives (code_sim_learning/utils.py):
- apply_rules(state, rules, params) → ProcessUpdate
- merge_updates(base_state, updates, process_features) → State
- simulate_step(state, action, base_env, rules, params, features) → State
These replace _build_fitted_step_fn, merge_process_updates,
_sim_fn_from_rules, and the body of _build_combined_simulator.

GT simulator factory (ground_truth_models):
- GroundTruthSimulatorFactory ABC + get_gt_simulator(env_name) discovery,
  following the existing get_gt_options / get_gt_nsrts pattern.
- PyBulletBoilGroundTruthSimulatorFactory registered in boil/.
- Replaces the hardcoded _load_oracle_simulator in the approach.

Oracle ablation flags (settings.py):
- agent_sim_learn_oracle_sim_program: load GT rules, skip synthesis.
- agent_sim_learn_oracle_sim_params: use GT param values, skip MCMC.

Also: kin_env → base_env rename throughout, redundant self._types
assignment removed, process_features computed once in __init__.
- yapf + isort autoformatting applied to all touched files.
- pylint: fix logging-not-lazy in agent_bilevel_explorer, add
  broad-except and reimported disables in agent_sim_learning_approach.
- mypy: fix base/env variable name collision, add type: ignore on
  lambda inference, add return type annotations to GT factory methods.
Use utils.abstract to evaluate expected atoms in low-level search so
that DerivedPredicates — which require a Set[GroundAtom] rather than a
State — are handled correctly alongside regular predicates.
When sequential simulate calls differ only in process features (as in
the combined kinematic+learned simulator), reapplying joint positions
and tearing down/recreating grasp constraints causes visible arm
jitter. Compare robot poses first and skip the kinematic reset path
when they already match.
Factor simulator synthesis into a shared _learn_simulator helper so
that both learn_from_offline_dataset and learn_from_interaction_results
can trigger it on their respective trajectory sources. Also create a
separate headless env for parameter fitting so MCMC's thousands of
_set_state calls don't thrash the GUI env during training.
Converts _build_combined_simulator to an instance method so it can
capture self, recreate the base env on pybullet.error, and retry once.
Also catches pybullet.error in the oracle option model alongside
OptionExecutionFailure. Updates agents.yaml config for testing.
… moves

Switch the fitting loss from per-feature MSE to total SSE (drop the /count
in compute_sse) so the Gaussian log-likelihood -0.5*SSE/sigma^2 is in its
correct iid form. The previous MSE form silently rescaled per-observation
noise by sqrt(count), making walker proposals indistinguishable from each
other. Pair this with a wider walker initialization (0.5 * prior_sigma
instead of 1% of init_value) so the swarm covers the prior support and
emcee stretch moves can actually explore.
Unifies oracle and agent-synthesized simulators behind one loader:
read_simulator_components pulls PROCESS_RULES, PARAM_SPECS, and
PROCESS_FEATURES out of any namespace (module dict for oracle,
exec_ns for agent), and get_gt_simulator now returns the triple
including features. merge_updates no longer takes process_features
since the rule producer owns that scope.
Replaces hard ``dist < threshold`` indicators in the boil rules with
sigmoid-smoothed gates of width ``_SOFT_EPS``. Without smoothing, the
LM finite-difference Jacobian is ~zero almost everywhere, and the
Hessian identifiability diagnostic is uninformative; emcee also gets
a non-flat likelihood as a side effect. State-dependent gates
(faucet on/off, jug held) stay hard since they don't enter the
parameter likelihood.
Adds fit_map_lm (Levenberg-Marquardt MAP estimate via SciPy TRF) and
log_hessian_identifiability (eigendecompose J^T J/sigma^2 + prior
precision to flag sloppy parameter directions). Both run as a single
LM pass before MCMC; fit_params now centers walkers on theta_map
when code_sim_learning_warm_start_with_lm is set, and short-circuits
to it directly when num_mcmc_steps == 0. Also adds compute_residuals
(per-feature residual vector LM consumes) and log_sse_breakdown
(per-(type, feature) SSE so we can see which features dominate the
loss). Two CFG flags gate the new behavior:
warm_start_with_lm (default True), log_hessian_identifiability
(default False).
The agent now declares its own PROCESS_FEATURES alongside
PROCESS_RULES and PARAM_SPECS, and the loss is scoped to that
declaration (instead of every feature on every type). Before
synthesis, the approach runs the base sim on each transition and
flags (type, feat) pairs whose prediction diverges from the
observation on at least min_hits triples; this set is sent to the
agent as a starting hint and used as the eval/test scope until the
agent overrides it. The base-sim prediction is precomputed once
into base_pred_triples so MCMC's inner loop only evaluates the
cheap apply_rules step. create_synthesis_tools now takes the
precomputed triples plus the inferred hint, drops the live base_env,
and reads PROCESS_FEATURES from exec_ns each call (falling back to
the hint when undeclared).
LM warm start alone matches the parameter fit for the current
boil oracle program; emcee's MAP-of-walkers cannot improve on it
in the time budgeted for 500 steps and routinely lands at higher
SSE. Setting num_mcmc_steps to 0 and enabling warm_start_with_lm
returns the LM theta_map directly.
Cleans up line-wrap and docstring drift across the sim-learning
branch so the autoformat CI check is satisfied. Bundles the
formatting-only changes for cogman, pybullet_boil, and utils that
earlier branch commits left behind, plus minor wraps across the
new sim-learning code.
``BaseEnv`` doesn't declare ``_physics_client_id`` (only PyBullet
subclasses do), and ``_recreate_base_env`` reads it best-effort
inside a try block. Bind to a local with type:ignore so mypy stops
flagging the access without affecting runtime.
The simulator callback signature must match StepSimulatorFn's
(state, action, params) shape even though apply_rules doesn't use
the action. Renaming to _action signals intent and silences pylint's
unused-argument check.
Replace the all-or-nothing kinematic-match gate with a per-component
diff: robot pose, each object pose, and held-object identity are each
compared against the live PyBullet world and only re-written when they
actually differ. _robot_matches_state now compares at the joint level
(the prior EE-quaternion path hard-coded roll=0, which spuriously
mismatched whenever the wrist had any roll and forced a full reset on
every simulate() call). reset_state honors caller-provided
joint_positions only when they reconstruct the requested EE pose,
falling back to IK otherwise.
_remake_cups creates fresh PyBullet bodies that need to be teleported
to their state-specified poses; the per-component diff in _set_state
now skips objects whose pose already matches PyBullet, so the explicit
_reset_single_object calls ensure freshly-recreated bodies land in the
right place. Same treatment for plugs when coffee_machine_has_plug.
The lambda used to capture predicates at __init__ time, which missed
predicates invented later (grammar search) and broke subclasses whose
_get_current_predicates depends on attributes not yet set during
super().__init__().
Terminology cleanup to match how skip_process_dynamics is described
elsewhere; the env wraps the full base sim, not just kinematics.
The fast-path joint-match check used atol=1e-2, which let a caller's
initial_joint_positions hint be silently treated as "already there"
when live joints were within 1e-2 of initial — leaving the EE pose up
to ~3e-3 off the requested state. State.allclose compares features at
1e-3, so the test then failed reconstruction. Match the State.allclose
tolerance.

Also pick up trailing yapf reformatting in two approach files.
…st-split

Both tests pass on master and in isolation but fail on shards 6/8 of CI
on this branch. The branch's new tests shifted pytest-split's
least_duration distribution so existing tests landed in different shards
than on master, exposing pre-existing fragility:

- test_glib_explorer[Holding]: score_fn returned 0 (not -inf) for
  non-target goals, so they weren't filtered. With cover's 7-atom
  dynamic universe and 10 babbles, ~3.5% of seeds sample no Holding goal
  and the explorer falls through to a Covers goal, leaving the final
  state without Holding. Bumped glib_num_babbles to 100 and switched
  the test's score_fn to return -inf for non-target so the explorer
  never plans toward an off-target predicate.

- test_demo_dataset_loading[10-True-oracle-...]: _ensure_cover_demo_
  data_exists only checked file existence. test_demo_dataset's
  max_initial_demos block writes a 3-trajectory dataset under the
  cover__demo__oracle__7__... name; the [10-...] case then loaded 3 +
  generated 3 = 6, expected 10. Added a trajectory-count check so the
  helper regenerates partial files.
…ects

- test_robot_matches_state_atol_forces_reset_on_small_drift: locks in
  the 1e-3 atol regression. A ~5e-3 joint drift (within the previous
  1e-2 tolerance, outside the new 1e-3) must NOT be treated as
  "already there" by the fast-path; _set_state must move the robot
  back to the requested EE pose at State.allclose precision.

- tests/pybullet_helpers/test_objects.py (new): coverage for
  sample_collision_free_2d_positions, used by 3 PyBullet envs but
  previously without direct tests. Covers no-overlap (circles and
  rectangles), bounds, reproducibility across seeds, RuntimeError on
  impossible packing, and ValueError on unknown shape_type.
@yichao-liang yichao-liang marked this pull request as ready for review May 5, 2026 09:29
@yichao-liang yichao-liang merged commit e551684 into master May 5, 2026
14 checks passed
@yichao-liang yichao-liang deleted the sim-learning branch May 5, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant