Skip to content

V-JEPA2-AC in Isaac Lab / Omniverse: Franka arm stops above cube without grasping #169

@sinansoulier

Description

@sinansoulier

Dear all,

I am trying to integrate V-JEPA2-AC into an Isaac Lab / Omniverse manipulation simulation with a Franka arm and cubes.

The robot uses V-JEPA2-AC as the only action source. There are no scripted pick/place phases.

Setup

Simulator: Isaac Lab / Omniverse
Robot: Franka Panda
Task: Isaac-Stack-Cube-Franka-IK-Rel-Blueprint-v0
Control mode: IK relative action task
Camera: external RGB camera, droid_exterior_image_1_left
Policy source: V-JEPA2-AC only

Hyperparameters

rollout: 2
samples: 64
topk: 5
cem_steps: 10
momentum_mean: 0.15
momentum_mean_gripper: 0.15
momentum_std: 0.25
momentum_std_gripper: 0.15
maxnorm: 0.075
verbose: false

Goal

Image

Simulation screenshot

Image

Issue

The gripper moves toward the cube, but consistently stops a few centimeters above it. It does not descend far enough to make contact or grasp the cube.

I tried using:

  1. A target image where the cube has been moved
  2. A target image where the gripper is already holding the cube.

In both cases, the behavior is similar: the robot approaches the cube but does not complete the final descent / grasp.

Expected Behavior

Given a target image where the cube has been moved, I would expect the model/controller to produce actions that:

  1. Move the gripper above the cube,
  2. Descend to the cube,
  3. Close the gripper,
  4. Lift or move the cube toward the target state.

Actual behavior

The arm moves toward the cube, but the end-effector hovers slightly above or next to it and does not complete the grasp.

The failure mode is very consistent: the approach motion works, but the final downward/contact/grasp phase does not happen. Additionally, the gripper does not open and close consistently.

What I tried

  • Increasing rollout up to 4.
  • Increasing number of samples up to 256.
  • Increasing number of CEM steps up to 30

Current integration

Inference

I call V-JEPA 2-AC as follows:

...
with torch.no_grad():
    z_current = self.encode_single_frame(current_rgb)

    s = torch.tensor(
        ee_state7.reshape(1, 1, 7),
        dtype=torch.float32,
        device=self.device,
    )

    action = self.world_model.infer_next_action(
        z_current,
        s,
        self.z_goal,
    )

    action_np = action.detach().float().cpu().numpy().reshape(-1)

The returned action is mapped directly to the Isaac Lab action:

actions[env_id, 0:3] = pos_gain * vjepa_action.delta_xyz
actions[env_id, 3:6] = rot_gain * vjepa_action.delta_rpy
actions[env_id, -1]  = gripper_gain * vjepa_action.gripper

with:

pos_gain = 1.0
rot_gain = 1.0
gripper_gain = 1.0

Minimal controller code

@dataclass
class VJepaAction:
    delta_xyz: np.ndarray
    delta_rpy: np.ndarray
    gripper: float

class VJEPAMPC:
    def __init__(self, ...):
        ...

    def encode_single_frame(self, rgb: np.ndarray):
        ...

    def infer_action(
        self,
        current_rgb: np.ndarray,
        ee_state7: np.ndarray,
    ) -> VJepaAction:
        if not self.available:
            raise RuntimeError("V-JEPA2-AC is unavailable.")

        current_rgb = to_uint8_rgb(current_rgb)

        with torch.no_grad():
            ...

        return parse_vjepa_action(action_np)

class VJepaIsaacLabActionController:
    def __init__(self, ...):
        ...

    def _make_empty_actions(self):
        ...

    def _get_camera_rgb_batch(self):
        ...

    def _get_ee_state7_batch(self):
        ...

    def forward(self):
        if not self.vjepa.available:
            raise RuntimeError("V-JEPA2-AC is unavailable.")

        rgb_batch = self._get_camera_rgb_batch()
        ee_state7_batch = self._get_ee_state7_batch()

        actions = self._make_empty_actions()

        for env_id in range(self.num_envs):
            vjepa_action = self.vjepa.infer_action(
                current_rgb=rgb_batch[env_id],
                ee_state7=ee_state7_batch[env_id],
            )

            delta_xyz = np.asarray(
                vjepa_action.delta_xyz,
                dtype=np.float32,
            ).reshape(3)

            delta_rpy = np.asarray(
                vjepa_action.delta_rpy,
                dtype=np.float32,
            ).reshape(3)

            gripper = float(vjepa_action.gripper)

            actions[env_id, 0:3] = torch.as_tensor(
                self.pos_gain * delta_xyz,
                device=self.device,
                dtype=torch.float32,
            )

            actions[env_id, 3:6] = torch.as_tensor(
                self.rot_gain * delta_rpy,
                device=self.device,
                dtype=torch.float32,
            )

            actions[env_id, -1] = self.gripper_gain * gripper

        return actions

Questions

  • Could this behavior be caused by unsuitable inference/planning hyperparameters (e.g. planning horizon, number of candidate actions, etc.)?
  • Could this behavior be caused by an integration convention mismatch?
  • Is this expected out-of-distribution behavior? If so, would you recommend post-training the model?
  • I also tried a target image where the gripper is already holding the cube. Is this a supported / recommended type of target for V-JEPA2-AC? Or should the target image show only the final object configuration, without requiring the model to infer grasp contact from the image?
  • Does V-JEPA2-AC require a specific camera viewpoint, and are the original DROID camera conventions important for successful grasping?

Thank you very much in advance for you help and insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions