Adapters in 2D and 3D Vision Foundation Models

Introduction

Foundation models are large, pre-trained models designed to capture general representations from vast datasets, which can then be adapted to specific downstream tasks. In 2D vision, examples include Vision Transformers (ViTs) like CLIP or DINO, while in 3D vision, models like PointNet handle point clouds for tasks such as shape recognition. Fully retraining these models for new tasks is computationally expensive due to their size and complexity. Adapters offer an efficient solution by introducing a small number of trainable parameters to adapt the model, while keeping most of the pre-trained parameters frozen.

This document covers:

  1. Basics of LoRA (Low-Rank Adaptation)
  2. Other Adapter Types
  3. Implementation Details
  4. Pros and Cons of Adapters
  5. Possible Optimizations

1. Basics of LoRA (Low-Rank Adaptation)

What is LoRA?

LoRA, or Low-Rank Adaptation, is a technique that adapts pre-trained models by adding low-rank updates to specific weight matrices, typically in the attention layers of transformer-based models. Instead of updating the entire weight matrix during fine-tuning, LoRA approximates the weight update as the product of two low-rank matrices and , where . Here, has shape and has shape , with (the rank) being much smaller than , the original dimension of . During training, only and are updated, while remains frozen.

In vision transformers, LoRA is commonly applied to the query () and value () projection matrices in the self-attention mechanism:

  • Query:
  • Value:

This approach reduces the number of trainable parameters significantly. For a matrix, full fine-tuning requires parameters, whereas LoRA requires only , where .

Why Use LoRA?

LoRA is ideal for adapting large foundation models efficiently, making it applicable to both 2D vision (e.g., ViTs) and 3D vision (e.g., MLPs in PointNet), where attention or linear layers are prevalent.

  • Research Insight: The paper "LoRA: Low-Rank Adaptation of Large Language Models" demonstrates that LoRA can reduce trainable parameters by up to 10,000 times compared to full fine-tuning, with comparable or better performance.
  • Implementation Example: For vision models, repositories like "LoRA-ViT" show its application, achieving 1.8x to 1.9x faster training on M1 Pro hardware.

2. Other Adapter Types

Beyond LoRA, several adapter types are used to adapt 2D and 3D vision foundation models, offering flexibility based on task and architecture.

Adapter Modules

Adapter modules are small neural networks inserted between the layers of a pre-trained model. In transformers, they are typically placed after the attention or feed-forward layers. A common design includes:

  • A down-projection (reducing dimension from to a smaller ),
  • A non-linearity (e.g., ReLU),
  • An up-projection (returning to dimension ).

The output is combined with the input via a residual connection:

  • 2D Vision Example: The "ViT-Adapter" paper proposes a composition of three modules—spatial prior module, spatial feature injector, and multi-scale feature extractor—enhancing ViT for dense prediction tasks like segmentation and detection. The GitHub repository provides implementations showing competitive performance.
  • 3D Vision Example: Adapter modules can be added after MLPs in models like PointNet or operate on vertex/edge features in mesh processing models like MeshCNN.

Prompt-based Adapters

Prompt tuning involves adding learnable tokens or "prompts" to the input sequence of the model. In 2D vision transformers, this might mean appending extra patch embeddings to the sequence of image patches.

  • 2D Vision Example: The "Visual Prompt Tuning" paper introduces VPT, adding less than 1% of model parameters in the input space while freezing the backbone, achieving significant performance gains over full fine-tuning, especially in low-data regimes.
  • 3D Vision Example: Prompts can be added as extra point features in point cloud inputs, as seen in the "Point-PEFT" framework.

Adapters in 3D Vision

For 3D vision models, specialized adapters handle point clouds or meshes.

  • Point-PEFT: The "Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models" paper introduces a PEFT method for point clouds, including:
    • Point-prior Prompt: Learnable prompt tokens enhanced by a memory bank with domain-specific knowledge.
    • Geometry-aware Adapter: Inserted after self-attention and FFN, aggregating local geometric information via FPS, k-NN, pooling, and propagation.
    This framework achieves high accuracy on benchmarks like ModelNet40 (94.2% with 0.8M parameters) and ScanObjectNN (89.1% with 0.7M parameters), outperforming other PEFT methods with fewer parameters.

3. Implementation Details

Implementing adapters requires slight modifications to the foundation model's architecture. Below are examples for each type, with a focus on 2D and 3D vision contexts.

LoRA Implementation

In a 2D Vision Transformer, LoRA modifies the linear layers in the attention mechanism. Here's a simplified PyTorch example:

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, original_layer, rank):
        super().__init__()
        self.original_layer = original_layer
        self.A = nn.Parameter(torch.randn(original_layer.in_features, rank))
        self.B = nn.Parameter(torch.randn(rank, original_layer.out_features))
    
    def forward(self, x):
        original_output = self.original_layer(x)
        lora_update = x @ self.A @ self.B
        return original_output + lora_update

# Usage in a ViT
original_layer = nn.Linear(768, 768)  # Example dimension from ViT
lora_layer = LoRALayer(original_layer, rank=8)
  • 3D Vision Note: For models like PointNet, similar modifications can be applied to linear layers.
  • Practical Guide: The Hugging Face PEFT documentation provides a guide for LoRA in image classification using ViT, with settings like r=16, lora_alpha=16, and target modules ["query", "value"], reducing trainable parameters to 0.77% of the original model.

Adapter Module Implementation

For adapter modules inserted after a transformer layer:

class Adapter(nn.Module):
    def __init__(self, d, d_adapter):
        super().__init__()
        self.down = nn.Linear(d, d_adapter)
        self.relu = nn.ReLU()
        self.up = nn.Linear(d_adapter, d)
    
    def forward(self, x):
        return x + self.up(self.relu(self.down(x)))

# Usage
adapter = Adapter(d=768, d_adapter=64)  # Bottleneck size of 64
  • 3D Vision Note: This can be placed after PointNet's per-point MLPs.
  • Implementation Example: The ViT-Adapter GitHub repository includes Colab notebooks for detection and segmentation.

Prompt-based Adapter Implementation

For prompt tuning in a 2D Vision Transformer:

class PromptedViT(nn.Module):
    def __init__(self, vit_model, num_prompts, d):
        super().__init__()
        self.vit_model = vit_model
        self.prompts = nn.Parameter(torch.randn(num_prompts, d))
    
    def forward(self, x):
        batch_size = x.size(0)
        prompts = self.prompts.unsqueeze(0).repeat(batch_size, 1, 1)
        x = torch.cat([prompts, x], dim=1)  # Prepend prompts to patch embeddings
        return self.vit_model(x)

# Usage
prompted_model = PromptedViT(vit_model, num_prompts=10, d=768)
  • 3D Vision Note: In 3D, prompts can be added as extra point features, as in Point-PEFT, where the optimal prompt length is 10.

Point-PEFT for 3D Models

Point-PEFT's implementation includes:

  • Point-prior Prompt MLP inner dimension: 8.
  • Geometry-aware Adapter first MLP inner dimension: 16; last MLP dimensions: (8,16,16) for ScanObjectNN, (16,16,16) for ModelNet40.
  • k-NN settings: N,k as (32,16) for Point-BERT, (32,8) for Point-M2AE, (64,16) for Point-MAE.
  • Code Availability: Point-PEFT GitHub Repository

4. Pros and Cons of Adapters

Adapters offer significant benefits but also come with trade-offs. Here's a detailed breakdown:

Aspect Pros Cons
Parameter Efficiency Add only a small fraction of trainable parameters (e.g., 0.77% for LoRA in ViT). May require multiple adapters for different tasks, increasing management overhead.
Computational Efficiency Reduce memory and compute requirements by freezing most parameters. Performance may not match full fine-tuning, especially for tasks far from pre-training domain.
Modularity Separate adapters can be trained and swapped for different tasks without altering the core model. Implementation complexity increases, requiring architectural modifications.
Reduced Overfitting Fewer trainable parameters lower the risk of overfitting, especially on small datasets. Hyperparameter tuning (e.g., rank, bottleneck size) adds complexity to the process.
  • Research Insight: Point-PEFT achieves 94.2% accuracy on ModelNet40 with 0.8M parameters versus 93.2% with 22.1M for full fine-tuning, highlighting adapters' potential to outperform full fine-tuning in some cases.

5. Possible Optimizations

To enhance the effectiveness and efficiency of adapters, consider the following optimizations:

  • LoRA Optimizations:
    • Rank Selection: Choose an optimal rank based on task complexity; for example, the Hugging Face guide uses r=16 for ViT.
    • Efficient Computation: Use optimized matrix multiplication libraries to speed up .
  • Adapter Module Optimizations:
    • Bottleneck Size: Minimize to reduce parameters while retaining expressiveness.
    • Layer Sharing: Share adapter weights across multiple layers to further cut parameter counts.
  • Prompt-based Optimizations:
    • Initialization: Initialize prompts using statistics from pre-training data to improve convergence.
    • Prompt Count: Experiment with the number of prompts; Point-PEFT finds length 10 optimal.
  • General Optimizations:
    • Mixed Precision Training: Use 16-bit precision to reduce memory usage during adapter training.
    • Gradient Checkpointing: Trade computation for memory by recomputing intermediate activations.
    • Sparse Structures: In 3D vision, leverage sparsity (e.g., in point clouds) to design adapters focusing on local neighborhoods.
  • 3D-Specific Optimizations:
    • Exploit spatial structure in point clouds or meshes by designing adapters that operate on local feature groups, enhancing geometric awareness.

Conclusion

This document provides a comprehensive overview of adapters in 2D and 3D vision foundation models. Starting with LoRA, we explored its low-rank approach, then covered adapter modules, prompt-based methods, and 3D-specific adapters like Point-PEFT. Implementation details, pros, cons, and optimizations offer a holistic view, enabling efficient adaptation of large models to new tasks with minimal computational overhead.