Skip to content

Questions on real-time performance and chunk definition #3

@komkmm

Description

@komkmm

Hi, thanks for the great work!

I have a question regarding the real-time performance claim.

From the paper, I saw the following:

  • The Generator and Refiner each take ~700 ms, and the VAE requires ~180 ms.
  • The dual 17B DiT backbones run at ~0.35 s per chunk (1-NFE) on a single GPU.

Given these timings, does the full pipeline (Generator + Refiner + VAE) achieve real-time streaming on a single GPU in practice, or does it rely on multiple GPUs?

Also, the paper mentions that autoregressive generation is discretized into fixed 1-second chunks (24 fps). Does one chunk correspond to a latent length of 7 (i.e., roughly equivalent to 25 pixel frames), or am I misunderstanding this mapping?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions