KoRA is a novel Parameter-Efficient Fine-Tuning (PEFT) strategy that introduces inter-adapter communication to learn robust, generalizable representations that transfer across domains — addressing a key limitation in methods like LoRA.
- 🎯 The Problem: The Brittleness of Specialization
- 💡 The KoRA Solution: From Specialists to a Coordinated Team
- 🔧 Architecture: How It Works
- 📊 Experimental Results
- 🚀 Getting Started
- 🗺️ Roadmap & Future Work
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have been revolutionary — allowing us to adapt massive pre-trained models to specific tasks at a fraction of the cost.
However, this efficiency comes with a hidden price: brittleness.
LoRA injects small, independent adapters into each layer.
While highly effective for a single task, these adapters often overfit, learning features that don’t generalize well.
When a LoRA adapter trained on one task is transferred to another (especially across domains), performance drops significantly.
| Method | Source Task (CIFAR-100) | Transfer Task (Tiny ImageNet) |
|---|---|---|
| LoRA | 92.48% (Excellent Specialization) | 71.04% (Poor Generalization) |
| KoRA | 83.96% (Controlled Specialization) | 97.37% (Superior Generalization) |
❓ Can we create an adapter that learns fundamental, transferable knowledge, even if it sacrifices a few points on the source task?
The key limitation of LoRA is isolation — the query, key, and value adapters in a transformer block never communicate.
KoRA changes this paradigm.
Inspired by the Kolmogorov–Arnold Representation Theorem (which states that complex functions can be decomposed into compositions of simpler ones), KoRA introduces a learnable CompositionBlock.
- LoRA: A team of three brilliant specialists (Query, Key, Value) who never talk.
- KoRA: The same team, but they discuss their findings with a manager (the CompositionBlock), who synthesizes their insights into a unified decision.
This CompositionBlock creates functional dependency between adapters, forcing them to learn a shared, compositional representation of the task.
The architecture is a simple but powerful extension of LoRA.
For input x, standard LoRA adapters compute deltas for Query, Key, and Value:
Instead of applying them directly, concatenate and pass through the CompositionBlock (an MLP):
This single composed delta is then applied to the model using a learnable gate ( g ):
The gate ( g ) starts at zero, ensuring optimization stability and allowing the model to gradually “turn up the volume” on the compositional signal.
flowchart TB
subgraph KoRA_Adapter_Module
direction TB
IN([Input x]) --> Q_ADAPTER["Q-Adapter (LoRA)"]
IN --> K_ADAPTER["K-Adapter (LoRA)"]
IN --> V_ADAPTER["V-Adapter (LoRA)"]
Q_ADAPTER --> DELTA_Q("δ_q")
K_ADAPTER --> DELTA_K("δ_k")
V_ADAPTER --> DELTA_V("δ_v")
DELTA_Q --> CONCAT["Concat [δ_q, δ_k, δ_v]"]
DELTA_K --> CONCAT
DELTA_V --> CONCAT
CONCAT --> COMPOSE_BLOCK["CompositionBlock (MLP)"]
COMPOSE_BLOCK --> DELTA_COMP("δ_comp")
end
subgraph Applying_Update
direction TB
DELTA_COMP --> GATE_MULT["⊗ (Multiply)"]
GATE["g (Learnable Gate)"] --> GATE_MULT
V_ORIG["V_original"] --> ADD["⊕ (Add)"]
GATE_MULT --> ADD
ADD --> V_NEW["V_new"]
end
IN --> V_ORIG
flowchart LR
subgraph LORA["a) Low-Rank Adaptation (LoRA)"]
direction TB
IN_L["Input Embeddings (tokens x d)"] --> ORIG_L["Original Weights (Linear Layer)"]
ORIG_L --> OUT_L["Output Embeddings (tokens x d)"]
IN_L --> A_LoRA["Down Projection: A [r x d]"]
A_LoRA --> B_LoRA["Up Projection: B [d x r]"]
B_LoRA --> DELTA_L["Delta (tokens x d)"]
DELTA_L --> SCALE_L["Scale by alpha (α)"]
SCALE_L --> ADD_L["Elementwise Add (⊕)"]
ADD_L --> OUT_L
end
subgraph KORA["b) KoRA (this work)"]
direction TB
IN_K["Input Embeddings (tokens x d)"] --> BACKBONE["ViT Backbone (Frozen)"]
BACKBONE --> OUT_K["Backbone Outputs (tokens x d)"]
BACKBONE -.-> CAPTURE["Captured Inputs (via forward hooks)"]
CAPTURE --> ADAPTERS["Adapter Banks {Ai, Bi} (low-rank)"]
ADAPTERS --> PROJ["Adapter Projections (compose to r–d representations)"]
PROJ --> PSI["Psi (Ψ): Couplings"]
PROJ --> PHI["Phi (Φ): Composers"]
PSI --> COMPOSED["Composed Delta (Delta_compose)"]
PHI --> COMPOSED
COMPOSED --> SCALE_K["Scale by alpha (α)"]
SCALE_K --> ADD_K["Add to Backbone Outputs (⊕)"]
ADD_K --> OUT_K
end
LORA --- KORA
Fine-tuned for 5 epochs on CIFAR-100.
| Tuning Method | Params Tuned (%) | CKA Sim. | Accuracy (%) | F1 Score |
|---|---|---|---|---|
| LoRA (r=8) | 1.45 | 0.73 | 92.48 | 0.924 |
| Adapter Fusion | 1.45 | 0.71 | 92.22 | 0.922 |
| KoRA (d_comp=4) | 1.80 | 0.76 | 83.96 | 0.840 |
| KoRA (d_comp=8) | 2.18 | 0.76 | 84.19 | 0.842 |
🧠 Analysis:
LoRA dominates on single-task performance.
KoRA shows higher CKA similarity, suggesting more structured learning, even though raw accuracy drops slightly — a deliberate trade-off for generalization.
Models pre-trained on CIFAR-100, fine-tuned for one epoch on 1% of Tiny ImageNet.
| Tuning Method | Accuracy (%) | F1 Score |
|---|---|---|
| LoRA (r=8) | 71.04 | 0.8307 |
| Adapter Fusion | 46.67 | 0.6364 |
| KoRA (d_comp=4) | 97.37 | 0.9867 |
| KoRA (d_comp=8) | 98.24 | 0.9911 |
🚀 Analysis:
KoRA’s compositional approach achieves near-perfect transfer — dramatically outperforming LoRA.
Suggests that KoRA learns more robust, transferable features.
Preliminary CKA (Centered Kernel Alignment) shows that KoRA learns more structured, cross-layer dependencies.
Figure 2:
CKA similarity matrices for KoRA (left) and LoRA (right) between blocks 0, 6, and 11.
KoRA maintains higher similarity between distant layers (0.50 vs 0.45), indicating a more unified feature hierarchy.
Models pre-trained on CIFAR-100, fine-tuned for one epoch on 1% of NYU Depth V2 (monocular depth estimation).
| Method | Metric | Score |
|---|---|---|
| LoRA | RMSE | 0.2800 |
| AbsRel | 0.5629 | |
| KoRA | RMSE | 0.3327 |
| AbsRel | 0.6271 |
📉 Analysis:
LoRA performs better here — showing KoRA’s current bias towards classification tasks.
Future work will adapt KoRA for dense spatial reasoning.
- Python 3.8+
- PyTorch 2.0+
- Transformers
- timm
git clone https://github.com/onepunchmonk/kora.gitThe goal is to validate and refine KoRA to establish compositional adaptation as a superior approach for out-of-domain generalization.
- Explore attention-based inter-adapter communication
- Experiment with different basis functions for adapter projections
- Extend evaluation to Object Detection (MS COCO)
- Extend evaluation to Semantic Segmentation (MS COS & PASCAL VOC)
- Conduct ablation studies on CompositionBlock complexity and gate impact
- Investigate dense prediction performance further
- Perform full-layer CKA analysis
- Study Kolmogorov complexity connection by comparing parameter compressibility (KoRA vs LoRA)
🧭 Summary:
KoRA reframes fine-tuning as a cooperative, compositional process, not a set of isolated perturbations — achieving robust generalization and domain transferability far beyond current PEFT methods.



