Conference — COLM — 2026

CEKALA: CKA-Guided Layer Selection for Efficient Vision-Language Adaptation

Tasnimul Hossain Tomal, Md Fahim, Mir Sazzat Hossain, Md Farhad Alam Bhuiyan

Submitted to the Conference on Language Modeling (COLM), 2026.

Abstract

Pre-trained vision-language models (VLMs) such as CLIP offer strong generalization but face challenges in few-shot adaptation, particularly in identifying which layers to adapt and how to align cross-modal representations effectively. Existing multimodal adaptation methods uniformly apply adapters across fixed layers, assuming homogeneous layer importance and implicit depth-wise alignment between vision and text encoders. This assumption neglects layer-wise heterogeneity and cross-modal semantic misalignment. To overcome these limitations, we propose Centered Kernel Alignment based Layer Adapter (CEKALA), a representation measurement framework that leverages CKA to guide selective layer adaptation and cross-modal alignment. CEKALA first computes layer-wise CKA scores to quantify each layer's contribution to downstream performance, then identifies semantically aligned vision-text layer pairs based on CKA scores. Shared cross-modal adapters are injected only into aligned layer pairs, while unpaired layers receive modality-specific adapters, ensuring both semantic consistency and efficient parameter usage. CEKALA enables fine-grained, interpretable, and performance-aware layer selection for vision-language models. Empirical results demonstrate that CEKALA improves few-shot generalization and cross-modal alignment while main taining strong parameter efficiency.

Cite

@inproceedings{tomal-etal-2026-cekala,
    title = {CEKALA: CKA-Guided Layer Selection for Efficient Vision-Language Adaptation},
    author = {Tasnimul Hossain Tomal and Md Fahim and Mir Sazzat Hossain and Md Farhad Alam Bhuiyan},
    booktitle = {Conference on Language Modeling (COLM)},
    year = {2026},
    note = {Submitted}
}