University of Leicester
Browse

MIP-CLIP: Multimodal Independent Prompt CLIP for Action Recognition

Download (2.55 MB)
journal contribution
posted on 2025-04-04, 09:30 authored by X Gao, Z Chang, D Kong, Huiyu ZhouHuiyu Zhou, Y Lu
<p dir="ltr">Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.</p><p><br></p>

History

Author affiliation

College of Science & Engineering Comp' & Math' Sciences

Version

  • AM (Accepted Manuscript)

Published in

IEEE Transactions on Multimedia

Publisher

Institute of Electrical and Electronics Engineers

issn

1520-9210

eissn

1941-0077

Copyright date

2025

Available date

2025-10-30

Language

en

Deposited by

Professor Huiyu Zhou

Deposit date

2025-04-03

Usage metrics

    University of Leicester Publications

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC