Skip to content

Add rollout trace logging with trackio#1360

Open
abidlabs wants to merge 8 commits into
areal-project:mainfrom
abidlabs:add-trackio-trace-logging
Open

Add rollout trace logging with trackio#1360
abidlabs wants to merge 8 commits into
areal-project:mainfrom
abidlabs:add-trackio-trace-logging

Conversation

@abidlabs
Copy link
Copy Markdown

@abidlabs abidlabs commented May 21, 2026

Hi folks! This PR adds trace logging via Trackio, the free, local-first experiment tracking library from Hugging Face 🤗

AReaL already has an existing Trackio metrics backend, so this PR extends it to also include logging Traces. specifically I did this:

  • added logging rollout trajectories as trackio.Trace records when stats_logger.trackio.mode is enabled
  • added logging evaluation rollout trajectories as Trackio traces from the eval rollout path
  • decoded tensor trajectories into prompt/completion chat messages with reward, step, sequence length, prompt length, and version metadata
  • added stats_logger.trackio.max_rollout_traces_per_step to cap trace volume per step
  • documented Trackio trace logging and added mocked Trackio tests

Here's what it looks like:

image

AI assistance was used to prepare this PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements rollout and evaluation trace logging for the Trackio backend, allowing tensor trajectories to be decoded into human-readable traces with associated metadata. Key changes include the addition of the max_rollout_traces_per_step configuration, implementation of trace decoding in StatsLogger, and integration into the RLTrainer training and evaluation loops. Feedback focuses on optimizing performance by moving GPU-to-CPU tensor transfers for input IDs, masks, rewards, and versions outside of the per-sample processing loop to reduce synchronization overhead.

Comment thread areal/utils/stats_logger.py Outdated
Comment thread areal/utils/stats_logger.py Outdated
Comment thread areal/utils/stats_logger.py Outdated
metadata["reward"] = float(rewards[sample_index].item())
if versions is not None:
sample_versions = (
versions[sample_index, :seqlen].detach().cpu().tolist()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The versions tensor should also be moved to CPU outside the loop to avoid repeated GPU-to-CPU transfers for each sample.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@abidlabs abidlabs changed the title Add Trackio rollout trace logging Add rollout trace logging with trackio May 21, 2026
@abidlabs abidlabs marked this pull request as ready for review May 21, 2026 22:34
Comment thread areal/trainer/rl_trainer.py
@sitabulaixizawaluduo
Copy link
Copy Markdown
Collaborator

Thanks for your contribute! Please run 'pre-commit' before your submit

@abidlabs
Copy link
Copy Markdown
Author

abidlabs commented May 25, 2026

Ran pre-commit run --all-files and pushed the generated updates. The second pre-commit run passes cleanly. Thanks @sitabulaixizawaluduo!

Comment thread areal/utils/stats_logger.py Outdated
Comment thread areal/utils/stats_logger.py Outdated
trackio.Trace(
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": completion},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good way to support multi-turn traces

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added multi-turn/tool trace support by using structured messages when available and reconstructing user/assistant/tool spans from loss_mask otherwise. Here's what it looks like:

image

@abidlabs
Copy link
Copy Markdown
Author

Thanks for the review @sitabulaixizawaluduo @PrometheusComing! Addressed all of the comments and reran the pre-commit. All changes have been pushed.

TaoZex
TaoZex previously approved these changes May 30, 2026
Copy link
Copy Markdown
Collaborator

@TaoZex TaoZex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@PrometheusComing PrometheusComing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abidlabs
Copy link
Copy Markdown
Author

abidlabs commented Jun 1, 2026

Thanks folks! @sitabulaixizawaluduo ok to merge?

Comment thread areal/api/cli_args.py Outdated
space_id: str | None = None
"""HF Space ID for remote dashboard deployment (e.g. "user/my-space").
When set, metrics are also pushed to the specified Hugging Face Space."""
max_rollout_traces_per_step: int = 32
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value should be set to disabled, so that the behavior of previous users is not altered.

@abidlabs abidlabs dismissed stale reviews from PrometheusComing and TaoZex via feeb891 June 5, 2026 14:56
@abidlabs abidlabs force-pushed the add-trackio-trace-logging branch from feeb891 to 008932a Compare June 5, 2026 15:08
@abidlabs
Copy link
Copy Markdown
Author

abidlabs commented Jun 5, 2026

Thanks @sitabulaixizawaluduo, trace logging is now disabled by default, have updated the docs to reflect that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants