Question about Creative Video Generation (Prompt Stream)

Hi,

Thank you for your impressive work. The "Error-Recycling Fine-Tuning (ERFT)" method is very insightful for long video generation.

I have a clarification question regarding the Creative Video Generation (Streaming Storylines) capabilities mentioned in the paper.

In Appendix D, it is stated that "the proposed SVI-Shot and SVI-Film are trained with the MixKit Dataset consisting of 6K videos." However, looking at the publicly available MixKit dataset in the huggingface webpage, most videos are annotated with a single prompt for the entire clip.

This leads to a question about how the "Prompt Stream" capability was actually trained:

1. Multi-Prompt Training Data? Did you use a specially re-annotated version of the MixKit dataset where a single video has multiple prompts mapped to different timestamps to explicitly train scene transitions?

2. Emergent Capability via ERFT? Or, was SVI-Film trained on single-prompt clips, but the ERFT mechanism naturally enables prompt transitions during inference? Specifically, when a new prompt is provided in a "Creative" setting, does the model treat the visual mismatch (between the previous frames and the new prompt) as a "historical error" and use its learned correction capability to transition to the new scene?


I would appreciate any insights you could share on this. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Creative Video Generation (Prompt Stream) #94

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about Creative Video Generation (Prompt Stream) #94

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions