-
Notifications
You must be signed in to change notification settings - Fork 189
Description
Hi,
Thank you for your impressive work. The "Error-Recycling Fine-Tuning (ERFT)" method is very insightful for long video generation.
I have a clarification question regarding the Creative Video Generation (Streaming Storylines) capabilities mentioned in the paper.
In Appendix D, it is stated that "the proposed SVI-Shot and SVI-Film are trained with the MixKit Dataset consisting of 6K videos." However, looking at the publicly available MixKit dataset in the huggingface webpage, most videos are annotated with a single prompt for the entire clip.
This leads to a question about how the "Prompt Stream" capability was actually trained:
-
Multi-Prompt Training Data? Did you use a specially re-annotated version of the MixKit dataset where a single video has multiple prompts mapped to different timestamps to explicitly train scene transitions?
-
Emergent Capability via ERFT? Or, was SVI-Film trained on single-prompt clips, but the ERFT mechanism naturally enables prompt transitions during inference? Specifically, when a new prompt is provided in a "Creative" setting, does the model treat the visual mismatch (between the previous frames and the new prompt) as a "historical error" and use its learned correction capability to transition to the new scene?
I would appreciate any insights you could share on this. Thank you!