Controllable image captioning with feature refinement and multilayer fusion

Du, S; Zhu, H; Zhang, Y; Wang, D; Shi, J; Xing, N; Lin, G; Zhou, Huiyu

Controllable image captioning with feature refinement and multilayer fusion

journal contribution

posted on 2023-05-19, 09:47 authored by S Du, H Zhu, Y Zhang, D Wang, J Shi, N Xing, G Lin, Huiyu Zhou

Image captioning is the task of automatically generating a description of an image. Traditional image captioning models tend to generate a sentence describing the most conspicuous objects, but fail to describe a desired region or object as human. In order to generate sentences based on a given target, understanding the relationships between particular objects and describing them accurately is central to this task. In detail, information-augmented embedding is used to add prior information to each object, and a new Multi-Relational Weighted Graph Convolutional Network (MR-WGCN) is designed for fusing the information of adjacent objects. Then, a dynamic attention decoder module selectively focuses on particular objects or semantic contents. Finally, the model is optimized by similarity loss. The experiment on MSCOCO Entities demonstrates that IANR obtains, to date, the best published CIDEr performance of 124.52% on the Karpathy test split. Extensive experiments and ablations on both the MSCOCO Entities and the Flickr30k Entities demonstrate the effectiveness of each module. Meanwhile, IANR achieves better accuracy and controllability than the state-of-the-art models under the widely used evaluation metric.

Funding

This research was supported by the NSFC No. 61771386, and by the Key Research and Development Program of Shaanxi No. 2020SF-359, and by the Research and development of manufacturing information system platform supporting product lifecycle management No. 2018GY-030, Doctoral Research Fund of Xi’an University of Technology, China under Grant Program No. 103-451119003, and by the Natural Science Foundation of Shaanxi Province No. 2021JQ-487, and by the Xi’an Science and Technology Foundation No. 2019217814GXRC014CG015-GXYD14.11, and by Natural Science Foundation of Shaanxi Province No. 2023-JC-YB-550

History

Author affiliation

School of Computing and Mathematical Sciences, University of Leicester

Version

VoR (Version of Record)

Published in

Applied Sciences

Volume

13

Issue

8

Pagination

5020

Publisher

MDPI

issn

2076-3417

Copyright date

2023

Available date

2023-05-19

Publisher DOI

https://doi.org/10.3390/app13085020

Language

en

Publisher version

https://www.mdpi.com/2076-3417/13/8/5020

Usage metrics

Keywords

controllable image captioning information-augmented embedding MR-WGCN similarity loss

Licence

CC BY 4.0

Controllable image captioning with feature refinement and multilayer fusion

Funding

History

Author affiliation

Version

Published in

Volume

Issue

Pagination

Publisher

issn

Copyright date

Available date

Publisher DOI

Language

Publisher version

Usage metrics

Categories

Keywords

Licence

Exports