posted on 2025-07-25, 15:16authored byQ Shen, C Chen, D Han, Y Xu, X Wang, Huiyu ZhouHuiyu Zhou
<p dir="ltr">Multimodal alignment plays a critical role in vision-language tasks, directly influencing the model’s ability to understand and reason about the semantic relationships between images and texts. Existing static alignment methods model modality relations through predefined alignment structures, exhibiting strong stability but lacking the adaptability to dynamically adjust alignment strategies based on task-specific demands, thereby limiting model flexibility. In contrast, dynamic alignment approaches enhance responsiveness to complex semantics through interactive modeling mechanisms but often suffer from instability when confronted with high noise or inconsistent modality distributions. To effectively balance the flexibility and stability of alignment strategies, this paper proposes a novel Triple-Branch Hybrid Dynamic-Static Alignment (TriHDSA) strategy. This framework comprises three key branches: the hybrid alignment branch incorporates a dynamic capsule attention network to perform hierarchical and fine-grained reasoning based on static-aligned features, generating learnable decision-guiding information and dynamic routing weights; the elastic adjustment branch designs an adaptive Top-</p><p dir="ltr"> feature selection mechanism centered on routing weights and leverages backpropagation to dynamically adjust weight distributions, thereby enhancing the robustness of the dynamic reasoning process and mitigating instability caused by noisy data; the adaptive balancing branch measures the output distribution discrepancy between the hybrid alignment and elastic adjustment branches using Kullback-Leibler divergence loss to promote consistency and integration between the two alignment strategies, further improving the model’s generalization capability across diverse tasks. Extensive experiments on six public benchmark datasets across three classic vision-language tasks demonstrate that TriHDSA consistently outperforms most existing state-of-the-art methods, validating its effectiveness and generalization capability. The code will be available at: https://github.com/shenxiang-vqa/TriHDSA.</p>
Funding
Natural Science Foundation of Shanghai under Grant 25ZR1401156
Shanghai Maritime University’s Top Innovative Talent Training Program for Graduate Students, with grants 2023YBR017
History
Author affiliation
College of Science & Engineering
Comp' & Math' Sciences