A Streamlined Attention Mechanism for Image Classification and Fine-Grained Visual Recognition

Dakshayani D Himabindu; Praveen S Kumar

doi:10.13164/mendel.2021.2.059

Dakshayani D Himabindu Department of CSE, GIT, GITAM University
Praveen S Kumar Department of CSE, GIT, GITAM University

DOI: https://doi.org/10.13164/mendel.2021.2.059

Keywords: Visual attention, spatial attention, channel attention, fine-grained visual recognition, image classification, deep learning.

Abstract

In the recent advancements attention mechanism in deep learning had played a vital role in proving better results in tasks under computer vision. There exists multiple kinds of works under attention mechanism which includes under image classification, fine-grained visual recognition, image captioning, video captioning, object detection and recognition tasks. Global and local attention are the two attention based mechanisms which helps in interpreting the attentive partial. Considering this criteria, there exists channel and spatial attention where in channel attention considers the most attentive channel among the produced block of channels and spatial attention considers which region among the space needs to be focused on. We have proposed a streamlined attention block module which helps in enhancing the feature based learning with less number of additional layers i.e., a GAP layer followed by a linear layer with an incorporation of second order pooling(GSoP) after every layer in the utilized encoder. This mechanism has produced better range dependencies by the conducted experimentation. We have experimented our model on CIFAR-10, CIFAR-100 and FGVC-Aircrafts datasets considering finegrained visual recognition. We were successful in achieving state-of-the-result for FGVC-Aircrafts with an accuracy of 97%.

References

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2015).

Ch, R., et al. Deep Bi-linear Convolution Neural Network for Plant Disease Identification and Classification. 06 2021, pp. 293–305.

Chattopadhay, A., Sarkar, A., Howlader, P., and Balasubramanian, V. N. Gradcam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV) (2018), IEEE, pp. 839–847.

Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., and Chua, T.-S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6298–6306.

Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., and Belongie, S. J. Kernel pooling for convolutional neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 3049–3058.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Gao, Z., Xie, J., Wang, Q., and Li, P. Global second-order pooling convolutional networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 3019–3028.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778.

Hu, J., Shen, L., and Sun, G. Squeeze-andexcitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7132–7141.

Ionescu, C., Vantzos, O., and Sminchisescu, C. Matrix backpropagation for deep networks with structured layers. In 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 2965–2973.

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2015).

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (2012), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., vol. 25, Curran Associates, Inc.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature 521 (05 2015), 436–44.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.

Li, P., Xie, J., Wang, Q., and Gao, Z. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 947–955.

Lin, T.-Y., RoyChowdhury, A., and Maji, S. Bilinear cnn models for fine-grained visual recognition. 2015 IEEE International Conference on Computer Vision (ICCV) (2015), 1449–1457.

Liu, S., and Deng, W. Very deep convolutional neural network based image classification using small training sample size. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR) (2015), pp. 730–734.

Luong, T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. In EMNLP (2015).

Mnih, V., Heess, N. M. O., Graves, A., and Kavukcuoglu, K. Recurrent models of visual attention. In NIPS (2014).

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pre-training.

Rao, D. S., et al. Plant disease classification using deep bilinear cnn. INTELLIGENT AUTOMATION AND SOFT COMPUTING 31, 1 (2022), 161–176.

Selvaraju, R. R., et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (2017), pp. 618–626.

Vaswani, A., et al. Attention is all you need. In Advances in neural information processing systems (2017), pp. 5998–6008.

Wang, H., et al. Multi-scale location-aware kernel representation for object detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 1248–1257.

Wang, Q., et al. Eca-net: Efficient channel attention for deep convolutional neural networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 11531–11539.

Wang, Y., Long, M., Wang, J., and Yu, P. S. Spatiotemporal pyramid network for video action recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 2097–2106.

Woo, S., Park, J., Lee, J.-Y., and Kweon, I.-S. Cbam: Convolutional block attention module. In ECCV (2018).

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 2921–2929.