Skip to content

Question about Creative Video Generation (Prompt Stream) #94

@insoo-best

Description

@insoo-best

Hi,

Thank you for your impressive work. The "Error-Recycling Fine-Tuning (ERFT)" method is very insightful for long video generation.

I have a clarification question regarding the Creative Video Generation (Streaming Storylines) capabilities mentioned in the paper.

In Appendix D, it is stated that "the proposed SVI-Shot and SVI-Film are trained with the MixKit Dataset consisting of 6K videos." However, looking at the publicly available MixKit dataset in the huggingface webpage, most videos are annotated with a single prompt for the entire clip.

This leads to a question about how the "Prompt Stream" capability was actually trained:

  1. Multi-Prompt Training Data? Did you use a specially re-annotated version of the MixKit dataset where a single video has multiple prompts mapped to different timestamps to explicitly train scene transitions?

  2. Emergent Capability via ERFT? Or, was SVI-Film trained on single-prompt clips, but the ERFT mechanism naturally enables prompt transitions during inference? Specifically, when a new prompt is provided in a "Creative" setting, does the model treat the visual mismatch (between the previous frames and the new prompt) as a "historical error" and use its learned correction capability to transition to the new scene?

I would appreciate any insights you could share on this. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions