TY - JOUR
T1 - Automatically generating natural language descriptions of images by a deep hierarchical framework
AU - Huo, Lin
AU - Bai, Lin
AU - Zhou, Shang Ming
N1 - © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
PY - 2022/8
Y1 - 2022/8
N2 - Automatically generating an accurate and meaningful description of an image is very challenging. However, the recent scheme of generating an image caption by maximizing the likelihood of target sentences lacks the capacity of recognizing the human-object interaction (HOI) and semantic relationship between HOIs and scenes, which are the essential parts of an image caption. This article proposes a novel two-phase framework to generate an image caption by addressing the above challenges: 1) a hybrid deep learning and 2) an image description generation. In the hybrid deep-learning phase, a novel factored three-way interaction machine was proposed to learn the relational features of the human-object pairs hierarchically. In this way, the image recognition problem is transformed into a latent structured labeling task. In the image description generation phase, a lexicalized probabilistic context-free tree growing scheme is innovatively integrated with a description generator to transform the descriptions generation task into a syntactic-tree generation process. Extensively comparing state-of-the-art image captioning methods on benchmark datasets, we demonstrated that our proposed framework outperformed the existing captioning methods in different ways, such as significantly improving the performance of the HOI and relationships between HOIs and scenes (RHIS) predictions, and quality of generated image captions in a semantically and structurally coherent manner.
AB - Automatically generating an accurate and meaningful description of an image is very challenging. However, the recent scheme of generating an image caption by maximizing the likelihood of target sentences lacks the capacity of recognizing the human-object interaction (HOI) and semantic relationship between HOIs and scenes, which are the essential parts of an image caption. This article proposes a novel two-phase framework to generate an image caption by addressing the above challenges: 1) a hybrid deep learning and 2) an image description generation. In the hybrid deep-learning phase, a novel factored three-way interaction machine was proposed to learn the relational features of the human-object pairs hierarchically. In this way, the image recognition problem is transformed into a latent structured labeling task. In the image description generation phase, a lexicalized probabilistic context-free tree growing scheme is innovatively integrated with a description generator to transform the descriptions generation task into a syntactic-tree generation process. Extensively comparing state-of-the-art image captioning methods on benchmark datasets, we demonstrated that our proposed framework outperformed the existing captioning methods in different ways, such as significantly improving the performance of the HOI and relationships between HOIs and scenes (RHIS) predictions, and quality of generated image captions in a semantically and structurally coherent manner.
KW - context modeling
KW - generators
KW - human-object interaction (HOI)
KW - hybrid deep learning
KW - hybrid power systems
KW - image captioning
KW - image context
KW - image recognition
KW - natural language processing
KW - solid modeling
KW - task analysis
KW - visualization
UR - http://www.scopus.com/inward/record.url?scp=85099219570&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2020.3041595
DO - 10.1109/TCYB.2020.3041595
M3 - Article
AN - SCOPUS:85099219570
SN - 2168-2267
VL - 52
SP - 7441
EP - 7452
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 8
ER -