Abstract: Adapting CLIP to multimodal sentiment analysis has largely relied on optimizing textual prompts, yet such single-branch tuning fails to account for the evolving interplay between visual and ...