Skip to content

Stage ID is always 0 (NO_STAGE) in all parsed training data #40

Description

@ScavieFae

Bug

Every game in parsed-v2/games/ has stage=0 (NO_STAGE). 100% of training data — confirmed by sampling 200 random games.

Impact

The world model's stage embedding is useless. It receives the same stage=0 token for every game, so it can't distinguish between stages. This means:

  • No platform awareness: Model sees players standing at y=27 (Battlefield platforms) and y=0 (FD ground) and has no signal for why. It learns a blurred average of all stage geometries.
  • No stage-specific blast zones: Yoshi's Story blast zones are tiny (-175 to 173) vs Dreamland's are huge (-255 to 255). The model can't learn stage-specific KO thresholds.
  • Randall noise: Yoshi's Story has a moving cloud platform (Randall). Without stage context, players standing on it look like they're hovering in mid-air.
  • Platform tech patterns: BF/Dreamland tech chase patterns don't exist on FD, but the model mixes them all together.

Despite this, the model hits 97.6% action accuracy and 0.84 avg position error on teacher-forced evaluation — impressive, but position prediction in particular likely has a stage-confusion component.

Reproduction

from worldmodel.data.parse import load_game
from pathlib import Path
from collections import Counter
import random

games_dir = Path("~/claude-projects/nojohns-training/data/parsed-v2/games").expanduser()
all_games = [g for g in games_dir.iterdir() if not g.name.startswith('.')]
sample = random.sample(all_games, 200)
stage_counts = Counter()
for gf in sample:
    g = load_game(str(gf))
    stage_counts[g.stage] += 1

print(stage_counts)
# Counter({0: 200})  — 100% stage 0

Root cause

load_game() in worldmodel/data/parse.py:236 reads:

stage = root.field("stage")[0].as_py()

The data is genuinely 0 in the parquet files — the upstream Slippi parser that produced parsed-v2/ didn't extract the stage ID from the replay metadata. The stage field exists in the schema but was written as 0 for every game.

Fix

Need to either:

  1. Fix the Slippi→parquet parser to extract game_start.stage from replay metadata, then re-parse all 22K+ games
  2. Or backfill: read stage from raw .slp files and patch the existing parquet data

This is a re-parse of the full dataset, so it's a meaningful operation — but the fix itself should be straightforward once we find where stage extraction was dropped.

Discovery

Found during demo work — the viewer was showing "Stage 0" with fallback geometry for every game. Investigated expecting a viewer bug, turned out to be training data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions