Dear all,
I am trying to integrate V-JEPA2-AC into an Isaac Lab / Omniverse manipulation simulation with a Franka arm and cubes.
The robot uses V-JEPA2-AC as the only action source. There are no scripted pick/place phases.
Setup
Simulator: Isaac Lab / Omniverse
Robot: Franka Panda
Task: Isaac-Stack-Cube-Franka-IK-Rel-Blueprint-v0
Control mode: IK relative action task
Camera: external RGB camera, droid_exterior_image_1_left
Policy source: V-JEPA2-AC only
Hyperparameters
rollout: 2
samples: 64
topk: 5
cem_steps: 10
momentum_mean: 0.15
momentum_mean_gripper: 0.15
momentum_std: 0.25
momentum_std_gripper: 0.15
maxnorm: 0.075
verbose: false
Goal
Simulation screenshot
Issue
The gripper moves toward the cube, but consistently stops a few centimeters above it. It does not descend far enough to make contact or grasp the cube.
I tried using:
- A target image where the cube has been moved
- A target image where the gripper is already holding the cube.
In both cases, the behavior is similar: the robot approaches the cube but does not complete the final descent / grasp.
Expected Behavior
Given a target image where the cube has been moved, I would expect the model/controller to produce actions that:
- Move the gripper above the cube,
- Descend to the cube,
- Close the gripper,
- Lift or move the cube toward the target state.
Actual behavior
The arm moves toward the cube, but the end-effector hovers slightly above or next to it and does not complete the grasp.
The failure mode is very consistent: the approach motion works, but the final downward/contact/grasp phase does not happen. Additionally, the gripper does not open and close consistently.
What I tried
- Increasing rollout up to
4.
- Increasing number of samples up to
256.
- Increasing number of CEM steps up to
30
Current integration
Inference
I call V-JEPA 2-AC as follows:
...
with torch.no_grad():
z_current = self.encode_single_frame(current_rgb)
s = torch.tensor(
ee_state7.reshape(1, 1, 7),
dtype=torch.float32,
device=self.device,
)
action = self.world_model.infer_next_action(
z_current,
s,
self.z_goal,
)
action_np = action.detach().float().cpu().numpy().reshape(-1)
The returned action is mapped directly to the Isaac Lab action:
actions[env_id, 0:3] = pos_gain * vjepa_action.delta_xyz
actions[env_id, 3:6] = rot_gain * vjepa_action.delta_rpy
actions[env_id, -1] = gripper_gain * vjepa_action.gripper
with:
pos_gain = 1.0
rot_gain = 1.0
gripper_gain = 1.0
Minimal controller code
@dataclass
class VJepaAction:
delta_xyz: np.ndarray
delta_rpy: np.ndarray
gripper: float
class VJEPAMPC:
def __init__(self, ...):
...
def encode_single_frame(self, rgb: np.ndarray):
...
def infer_action(
self,
current_rgb: np.ndarray,
ee_state7: np.ndarray,
) -> VJepaAction:
if not self.available:
raise RuntimeError("V-JEPA2-AC is unavailable.")
current_rgb = to_uint8_rgb(current_rgb)
with torch.no_grad():
...
return parse_vjepa_action(action_np)
class VJepaIsaacLabActionController:
def __init__(self, ...):
...
def _make_empty_actions(self):
...
def _get_camera_rgb_batch(self):
...
def _get_ee_state7_batch(self):
...
def forward(self):
if not self.vjepa.available:
raise RuntimeError("V-JEPA2-AC is unavailable.")
rgb_batch = self._get_camera_rgb_batch()
ee_state7_batch = self._get_ee_state7_batch()
actions = self._make_empty_actions()
for env_id in range(self.num_envs):
vjepa_action = self.vjepa.infer_action(
current_rgb=rgb_batch[env_id],
ee_state7=ee_state7_batch[env_id],
)
delta_xyz = np.asarray(
vjepa_action.delta_xyz,
dtype=np.float32,
).reshape(3)
delta_rpy = np.asarray(
vjepa_action.delta_rpy,
dtype=np.float32,
).reshape(3)
gripper = float(vjepa_action.gripper)
actions[env_id, 0:3] = torch.as_tensor(
self.pos_gain * delta_xyz,
device=self.device,
dtype=torch.float32,
)
actions[env_id, 3:6] = torch.as_tensor(
self.rot_gain * delta_rpy,
device=self.device,
dtype=torch.float32,
)
actions[env_id, -1] = self.gripper_gain * gripper
return actions
Questions
- Could this behavior be caused by unsuitable inference/planning hyperparameters (e.g. planning horizon, number of candidate actions, etc.)?
- Could this behavior be caused by an integration convention mismatch?
- Is this expected out-of-distribution behavior? If so, would you recommend post-training the model?
- I also tried a target image where the gripper is already holding the cube. Is this a supported / recommended type of target for V-JEPA2-AC? Or should the target image show only the final object configuration, without requiring the model to infer grasp contact from the image?
- Does V-JEPA2-AC require a specific camera viewpoint, and are the original DROID camera conventions important for successful grasping?
Thank you very much in advance for you help and insights!
Dear all,
I am trying to integrate V-JEPA2-AC into an Isaac Lab / Omniverse manipulation simulation with a Franka arm and cubes.
The robot uses V-JEPA2-AC as the only action source. There are no scripted pick/place phases.
Setup
Simulator: Isaac Lab / Omniverse
Robot: Franka Panda
Task: Isaac-Stack-Cube-Franka-IK-Rel-Blueprint-v0
Control mode: IK relative action task
Camera: external RGB camera, droid_exterior_image_1_left
Policy source: V-JEPA2-AC only
Hyperparameters
Goal
Simulation screenshot
Issue
The gripper moves toward the cube, but consistently stops a few centimeters above it. It does not descend far enough to make contact or grasp the cube.
I tried using:
In both cases, the behavior is similar: the robot approaches the cube but does not complete the final descent / grasp.
Expected Behavior
Given a target image where the cube has been moved, I would expect the model/controller to produce actions that:
Actual behavior
The arm moves toward the cube, but the end-effector hovers slightly above or next to it and does not complete the grasp.
The failure mode is very consistent: the approach motion works, but the final downward/contact/grasp phase does not happen. Additionally, the gripper does not open and close consistently.
What I tried
4.256.30Current integration
Inference
I call V-JEPA 2-AC as follows:
The returned action is mapped directly to the Isaac Lab action:
with:
Minimal controller code
Questions
Thank you very much in advance for you help and insights!