LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows to adapt large pre-trained models to specific tasks while minimizing computational resources. ### Core Concept ![[Pasted image 20250505162740.png]] While pre-trained models have large weight matrices, the rank of the update matrices (changes) during fine-tuning can be much lower. LoRA does the following: 1. Freezes the original pre-trained model weights 2. Injects trainable low-rank decomposition matrices into model layers 3. Represents weight updates as the product of these smaller matrices Mathematically, instead of directly learning a weight update matrix $ΔW$ to add to the pre-trained weight matrix W, LoRA approximates this update using two smaller matrices: $ΔW = BA$ Where: - A is a matrix of size r × d (r << d) - B is a matrix of size d × r (r << d) - r is the rank parameter (typically 4-256) During inference, the effective weight matrix becomes: $W' = W + ΔW = W + BA$ LoRA is typically applied to selected weight matrices within transformer architectures: - Most commonly applied to attention layers (query, key, value, and output projection) - Can also be applied to feed-forward network layers - Usually initialized with A using random Gaussian initialization and B with zeros An additional scaling parameter α is used to control the magnitude of updates: $W' = W + α(BA)/r$ ## Advantages 1. Parameter Efficiency: Reduces trainable parameters by up to 10,000 times compared to full fine-tuning for LLMs. 2. Memory Efficiency: Reduces GPU memory requirements by approximately 3 times compared to full fine-tuning. 3. No Inference Latency: LoRA matrices can be merged with the original weights after training, resulting in no additional latency during inference. 4. Comparable or Better Performance: Often performs on par with or better than full fine-tuning despite having fewer trainable parameters. 5. Task Switching: Makes it easier to switch between different tasks by swapping only the LoRA weights instead of entire model checkpoints. ## Key Hyperparameters 1. Rank (r): Controls the capacity of the adaptation. Higher ranks provide more flexibility but increase parameter count. Usual values are 4-256, but after 64 or 128 there are diminishing returns. For smaller models `r` can be lower, for larger models it should be large. 2. Alpha (α): Scaling factor for the LoRA updates. A good heuristic is to set alpha to twice the rank value. 3. Target Modules: Which layers of the model to apply LoRA to. Applying LoRA to more layers generally improves performance but increases memory usage. 4. Dropout: Can be applied to LoRA matrices to prevent overfitting. ## Variants 1. QLoRA: Quantizes the base model to 4-bit precision to further reduce memory requirements. 2. Layer-wise Optimal Rank Adaptation: Assigns different ranks to different layers based on their importance. 3. AdaLoRA: Adaptively allocates parameter budget across layers. 4. VeRA: Vector-based Random Matrix Adaptation, where the low-rank decomposition uses random matrices. ## Links - [Original Paper: LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) - [QLoRA Paper: QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) - [AdaLoRA Paper: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.10512)