A Survey on Different Deep Learning Architectures for Image Captioning

b52c7d7d-56ec-4828-82a7-0f9b6f092d8c20210319073207012wseamdt@crossref.orgMDT DepositWSEAS TRANSACTIONS ON SYSTEMS AND CONTROL1991-876310.37394/23203http://wseas.org/wseas/cms.action?id=4073220202022020201510.37394/23203.2020.15http://wseas.org/wseas/cms.action?id=23195A Survey on Different Deep Learning Architectures for Image CaptioningM.NiveditaPhamila Y.Asnath VictyVellore Institute of Technology, Chennai, 600127, INDIAVision plays an important part which helps us to look at the world and perceive information about our surroundings. A human perceives information by looking at an object or the surrounding on the whole and tries to map visual features and attributes and by summarizing these features we can describe or tell about our surroundings. The way the human brain does this is still a huge mystery. But, For a machine/computer this task is what is called as Image Captioning. The computer or machine is fed with images from which they learn to extract features i.e pixel information, object position, geometry, etc. Using these features the machine tries to map it to a sentence word by word or on a whole which summarizes the information of the image. Due to the advancements in recent Computer Vision Methods and Deep Learning architectures, Computers have been able to correctly summarize images which have been fed to them. In this paper, we present a survey on the new types of architectures and the datasets which are being used to train such architectures. Furthermore, we have discussed future methods that can be implemented.1120202011202020635646https://www.wseas.org/multimedia/journals/control/2020/b265103-1005.pdf10.37394/23203.2020.15.63https://www.wseas.org/multimedia/journals/control/2020/b265103-1005.pdfAnderson, P., Fernando, B., Johnson, M., and Gould,S. (2016). SPICE: semantic propositional image caption evaluation. CoRR, abs/1607.08822.Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.10.1109/cvpr42600.2020.01059Cornia, M., Stefanini, M., Baraldi, L., and Cucchiara, R. (2019). Meshed-memory transformer for image captioning.Devlin, J., Chang, M., Lee, K., and Toutanova,K. (2018). BERT: pretraining of deep bidirectional transformers for language understanding.CoRR,abs/1810.04805.He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition.CoRR, abs/1512.03385.Hochreiter, S. and Schmidhuber, J. (1997). Longshort-term memory.Neural computation, 9:1735–80.10.1613/jair.3994Hodosh, M., Young, P., and Hockenmaier, J. (2013). Framing image description as a ranking task: Data,models and evaluation metrics.Journal of Artificial Intelligence Research, 47:853–899.Karpathy, A. and Li, F. (2014). Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306.Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Multimodal neural language models. In Xing, E. P.and Jebara, T., editors,Proceedings of the 31st Inter-national Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages595–603, Bejing, China. PMLR.10.1007/s11263-016-0981-7Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.,Shamma, D. A., Bernstein, M. S., and Li, F. (2016).Visual genome: Connecting language and vision using crowdsourced dense image annotations.CoRR,abs/1602.07332.10.1145/3065386Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25.10.1109/iccv.2019.00902Li, G., Zhu, L., Liu, P., and Yang, Y. (2019a). En-tangled transformer for image captioning. In The IEEE International Conference on Computer Vision (ICCV).Li, J., Yao, P., Guo, L., and Zhang, W.-C. (2019b). Boosted transformer for image captioning. Applied Sciences, 9:3260.Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.10.1007/978-3-319-10602-1_48Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D.,Girshick, R. B., Hays, J., Perona, P., Ramanan, D.,Doll ́ar, P., and Zitnick, C. L. (2014). Microsoft COCO:common objects in context.CoRR, abs/1405.0312.Liu, C., Mao, J., Sha, F., and Yuille, A. L.(2016a). Attention correctness in neural image captioning. CoRR, abs/1605.09553.Liu, S., Zhu, Z., Ye, N., Guadarrama, S., and Mur-phy, K. (2016b). Optimization of image description metrics using policy gradient methods.CoRR,abs/1612.00370.Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen,D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.10.1109/cvpr.2017.345Lu, J., Xiong, C., Parikh, D., and Socher, R. (2016). Knowing when to look: Adaptive attention via A visual sentinel for image captioning. CoRR, abs/1612.01887.Mun, J., Cho, M., and Han, B. (2016). Textguided attention model for image captioning.CoRR,abs/1612.03557.Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.(2002). Bleu: a method for automatic evaluation ofmachine translation. pages 311–318.10.18653/v1/p18-1238Sharma, P., Ding, N., Goodman, S., and Soricut, R.(2018). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL.Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition.arXiv 1409.1556.10.1109/cvpr.2016.308Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., andWojna, Z. (2015). Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567.10.1109/cvpr42600.2020.01305Tran, A., Mathews, A., and Xie, L. (2020). Transform and tell: Entity-aware news image captioning.CVPR.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,J., Jones, L., Gomez, A. N., Kaiser, L., and Polo-sukhin, I. (2017). Attention is all you need.CoRR,abs/1706.03762.10.1109/cvpr.2015.7299087Vedantam, R., Zitnick, C. L., and Parikh, D. (2014). Cider: Consensus-based image description evaluation.CoRR, abs/1411.5726.10.1109/cvpr.2015.7298935Vinyals, O., Toshev, A., Bengio, S., and Erhan, D.(2014). Show and tell: A neural image caption generator. CoRR, abs/1411.4555.Wang, Q. and Chan, A. B. (2018). CNN+CNN:convolutional decoders for image captioning.CoRR,abs/1805.09019.10.1007/978-3-319-46478-7_28Xu, H. and Saenko, K. (2015). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. CoRR, abs/1511.05234.You, Q., Jin, H., Wang, Z., Fang, C., and Luo,J. (2016). Image captioning with semantic attention.CoRR, abs/1603.03925.10.1162/tacl_a_00166Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78.