One of the novel AI models is computer vision. There are many different types of computer vision models, such as mask-rcnn, u-net, and so on. This project aims to introduce the state-of-the-art computer vision model, DeepLabV3+. This model is a semantic segmentation architecture, and atrous convolution was introduced in DeepLab as a tool to adjust and control the effective field-of-view of the convolution.
The above picture shows how the atrous convolution works. In comparison with regular CNN, the method skips pixels, and this can capture more information about pictures while they are in encoding.
The below picture shows the entire model architecture, and it shows that the atrous convolution method used in Atrous Spatial Pyramid Pooling (ASPP).
Python ==> 3.8
TensorFlow ==> version 2+
Assume DataFolder:
DeepLab
----> Train_Images (for training)
----> Train_Masks (for training)
----> Val_Images (for validation)
----> Val_Masks (for validation)
----> Test_Images (for testing)
----> Test_Masks (for testing)
----> Output (for Saving model)
Image Resources: Link
deeplabv3.mp4
After 20 epochs, DeepLabV3+ predicts those images with high accuracy.
Train Loss | Dice Coef | IoU | Val Loss | Val Dice Coef | Val IoU |
---|---|---|---|---|---|
0.0730 | 0.9270 | 0.8643 | 0.1173 | 0.8827 | 0.7906 |
DeepLabV3+ finishes the training in approximately 30 seconds with 128 x 128 images, and the RAM consumption on the GPU is around 8 GB. This is because I used ResNet50 as a backbone. Thus, selecting different models as the backbone would be different from the results.
I saw one issue in DeepLabV3+. The model struggles to differentiate between black and white. For example, the model is unable to perform with arms akimbo. This might be related to the image sizes. So, I need to try it with larger images in the future.