-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Open
Description
lesson 268: https://www.udemy.com/course/pytorch-for-deep-learning/learn/lecture/34049500#questions
RuntimeError: Error(s) in loading state_dict for VisionTransformer:
Missing key(s) in state_dict: "heads.head.weight", "heads.head.bias".
Unexpected key(s) in state_dict: "heads.weight", "heads.bias".
The heads format of transfer learning ViT model is different from the original ViT model:
the original ViT model has Sequential heads and Linear head.
ViT model after transfer learning
============================================================================================================================================
Layer (type (var_name)) Input Shape Output Shape Param # Trainable
============================================================================================================================================
VisionTransformer (VisionTransformer) [1, 3, 224, 224] [1, 3] 768 Partial
├─Conv2d (conv_proj) [1, 3, 224, 224] [1, 768, 14, 14] (590,592) False
├─Encoder (encoder) [1, 197, 768] [1, 197, 768] 151,296 False
│ └─Dropout (dropout) [1, 197, 768] [1, 197, 768] -- --
│ └─Sequential (layers) [1, 197, 768] [1, 197, 768] -- False
│ │ └─EncoderBlock (encoder_layer_0) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_1) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_2) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_3) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_4) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_5) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_6) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_7) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_8) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_9) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_10) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ │ └─EncoderBlock (encoder_layer_11) [1, 197, 768] [1, 197, 768] (7,087,872) False
│ └─LayerNorm (ln) [1, 197, 768] [1, 197, 768] (1,536) False
├─Linear (heads) [1, 768] [1, 3] 2,307 True
============================================================================================================================================
Total params: 85,800,963
Trainable params: 2,307
Non-trainable params: 85,798,656
Total mult-adds (Units.MEGABYTES): 172.47
============================================================================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 104.09
Params size (MB): 229.20
Estimated Total Size (MB): 333.89
============================================================================================================================================
Original ViT model
============================================================================================================================================
Layer (type (var_name)) Input Shape Output Shape Param # Trainable
============================================================================================================================================
VisionTransformer (VisionTransformer) [1, 3, 224, 224] [1, 1000] 768 True
├─Conv2d (conv_proj) [1, 3, 224, 224] [1, 768, 14, 14] 590,592 True
├─Encoder (encoder) [1, 197, 768] [1, 197, 768] 151,296 True
│ └─Dropout (dropout) [1, 197, 768] [1, 197, 768] -- --
│ └─Sequential (layers) [1, 197, 768] [1, 197, 768] -- True
│ │ └─EncoderBlock (encoder_layer_0) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_1) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_2) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_3) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_4) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_5) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_6) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_7) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_8) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_9) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_10) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ │ └─EncoderBlock (encoder_layer_11) [1, 197, 768] [1, 197, 768] 7,087,872 True
│ └─LayerNorm (ln) [1, 197, 768] [1, 197, 768] 1,536 True
├─Sequential (heads) [1, 768] [1, 1000] -- True
│ └─Linear (head) [1, 768] [1, 1000] 769,000 True
============================================================================================================================================
Total params: 86,567,656
...
Input size (MB): 0.60
Forward/backward pass size (MB): 104.09
Params size (MB): 232.27
Estimated Total Size (MB): 336.96
============================================================================================================================================
how to format the ViT heads after transfer learning so that it can be saved and loaded properly?
Metadata
Metadata
Assignees
Labels
No labels