University of Leicester
Browse

A Triple-Branch Hybrid Dynamic-Static Alignment Strategy for Vision-Language Tasks

Download (4.37 MB)
journal contribution
posted on 2025-07-25, 15:16 authored by Q Shen, C Chen, D Han, Y Xu, X Wang, Huiyu ZhouHuiyu Zhou
<p dir="ltr">Multimodal alignment plays a critical role in vision-language tasks, directly influencing the model’s ability to understand and reason about the semantic relationships between images and texts. Existing static alignment methods model modality relations through predefined alignment structures, exhibiting strong stability but lacking the adaptability to dynamically adjust alignment strategies based on task-specific demands, thereby limiting model flexibility. In contrast, dynamic alignment approaches enhance responsiveness to complex semantics through interactive modeling mechanisms but often suffer from instability when confronted with high noise or inconsistent modality distributions. To effectively balance the flexibility and stability of alignment strategies, this paper proposes a novel Triple-Branch Hybrid Dynamic-Static Alignment (TriHDSA) strategy. This framework comprises three key branches: the hybrid alignment branch incorporates a dynamic capsule attention network to perform hierarchical and fine-grained reasoning based on static-aligned features, generating learnable decision-guiding information and dynamic routing weights; the elastic adjustment branch designs an adaptive Top-</p><p dir="ltr"> feature selection mechanism centered on routing weights and leverages backpropagation to dynamically adjust weight distributions, thereby enhancing the robustness of the dynamic reasoning process and mitigating instability caused by noisy data; the adaptive balancing branch measures the output distribution discrepancy between the hybrid alignment and elastic adjustment branches using Kullback-Leibler divergence loss to promote consistency and integration between the two alignment strategies, further improving the model’s generalization capability across diverse tasks. Extensive experiments on six public benchmark datasets across three classic vision-language tasks demonstrate that TriHDSA consistently outperforms most existing state-of-the-art methods, validating its effectiveness and generalization capability. The code will be available at: https://github.com/shenxiang-vqa/TriHDSA.</p>

Funding

Natural Science Foundation of Shanghai under Grant 25ZR1401156

Shanghai Maritime University’s Top Innovative Talent Training Program for Graduate Students, with grants 2023YBR017

History

Author affiliation

College of Science & Engineering Comp' & Math' Sciences

Version

  • AM (Accepted Manuscript)

Published in

Neural Networks

Volume

191

Publisher

Elsevier

issn

0893-6080

eissn

1879-2782

Copyright date

2025

Available date

2025-07-25

Language

en

Deposited by

Professor Huiyu Zhou

Deposit date

2025-07-12

Usage metrics

    University of Leicester Publications

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC