Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible error when using mask for attention. #153

Open
TropComplique opened this issue Jun 5, 2024 · 1 comment
Open

Possible error when using mask for attention. #153

TropComplique opened this issue Jun 5, 2024 · 1 comment

Comments

@TropComplique
Copy link

Hi!
I believe that we must use a mask in window self attention only when we shift windows.
But here

attn_windows = self.attn(x_windows, mask=self.calculate_mask(x_size).to(x.device))

we use the mask all the time, for shifted and non shifted windows.

This might introduce errors at the bottom edge or at the right edge of an image.

@bigpieit
Copy link

@TropComplique i think you are right. taking 'grey_dn' task as an example. i printed the 'if mask is not None:' in forward() of windowattention. it printed 36 times since there 6 RSTB blocks and 6 STL layers in each RSTB. this means that all the window attention of STL are using mask and shifted one.
Image

Here is my preliminary analysis. For 'grey_dn' swinIR is initialized with img_size 128x128. at initialize time, even layer will have "attn_mask = none" and odd layer will have "attn_mask = self.calculate_mask(self.input_resolution)".

elif args.task == 'gray_dn':

Image

However, inference time the precomputed "attn_mask" will be used based on the inference time size. if you inference time image size "x_size" is no longer 128. then you will re-calculate mask for every swin transform layer and every swin tranform layer is using masked attention. It seems the comment in this closed one #13 is not correct.

Image

@JingyunLiang can you please help comment? I am concerned and curious about the results difference between training time and inference time. if train size is 128, then RSTB uses non-shifted attention and shifted&masked attention alternatively at training time. but inference time of a different image size will always use the shifted&masked attention. how would the results match...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants