You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your excellent work! I am a little confused about the SEG token design from the script llava_sam.py
1.SEG token invalid, I wonder when seg token is invalid, why you need to add the number 5?
2. If I understand it correctly, you put 5 sampled video frames input prompt as tokens, it supposed to generate 1 seg token for each data entry. However, I observed that you used here, which extract the last 5 indices of hidden states, why not 1? Additionally, for batch_size =2, and if frame_per_batch = [5, 5], then the seg_token_counts is [5, 5] instead of [1,1] from the current model. As the self.seg_token_idx is a single integer, are these five seg_tokens the same?
I observed a lot of cases that although you have frame_per_batch = [5, 5] the seg_token_counts is [3, 0], I wonder how to explain the case that only 3 seg tokens are generated instead of 5. How do you deal with the alignment issue?
Thanks a lot for your clarification!
Thanks,
Ruining
The text was updated successfully, but these errors were encountered:
The 5 in the first question does not actually mean 5 frames. The 5 here means that there are 5 instances in a set of image/video data. That is to say, 5 [SEG] tokens will generate 5 instance masks. The code you mentioned is mainly to support the empty execution for supporting zero3 during the training process to keep different GPUs executing the same code.
During training, the input texts look like this:
<image>
<user>
Please segment obj1.
<assistant>
It is [SEG].
<user>
Please segment obj2.
<assistant>
It is [SEG].
<user>
Please segment obj3.
<assistant>
It is [SEG].
...
Hi Authors,
Thanks for your excellent work! I am a little confused about the SEG token design from the script llava_sam.py
1.SEG token invalid, I wonder when seg token is invalid, why you need to add the number 5?
2. If I understand it correctly, you put 5 sampled video frames input prompt as tokens, it supposed to generate 1 seg token for each data entry. However, I observed that you used here, which extract the last 5 indices of hidden states, why not 1? Additionally, for batch_size =2, and if frame_per_batch = [5, 5], then the seg_token_counts is [5, 5] instead of [1,1] from the current model. As the self.seg_token_idx is a single integer, are these five seg_tokens the same?
Thanks a lot for your clarification!
Thanks,
Ruining
The text was updated successfully, but these errors were encountered: