diff --git a/index.html b/index.html index 646f7e3..844f9bb 100644 --- a/index.html +++ b/index.html @@ -105,8 +105,17 @@
- Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding. -
+ Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision +tasks. Yet, existing systems can only handle videos with very +few frames. For long videos, the computation complexity, +memory cost, and long-term temporal connection impose +additional challenges. Taking advantage of the AtkinsonShiffrin memory model, with tokens in Transformers being +employed as the carriers of memory in combination with +our specially designed memory mechanism, we propose +the MovieChat to overcome these challenges. MovieChat +achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for +validation of the effectiveness of our method. +