CLIP (Contrastive Language-Image Pre-training) is a neural net model developed by OpenAI that efficiently learns visual concepts from natural language supervision. It's a dual-encoder model that jointly trains an image encoder and a text encoder to produce similar embeddings for corresponding pairs. It has good zero-shot learning capabilities, allowing it to work for many tasks without being explicitly trained for them.
### Architecture
* Image Encoder: Vision Transformer (ViT) or ResNet takes images and outputs fixed-dimensional image embeddings (512 or 768 dimensionality)
* Text Encoder: Transformer processes text and outputs fixed-dimensional text embeddings. Maximum context length is 77 tokens
* Both encoders project their outputs into a shared embedding space, embeddings are L2-normalized to unit length, similarity is computed using dot product (equivalent to cosine similarity for normalized vectors)
### Training
CLIP uses contrastive learning on large-scale image-text pairs:
- A batch contains N image-text pairs
- Both images and texts are encoded into the shared embedding space
- Compute similarity matrix between all N×N possible image-text combinations
- Maximize similarity for correct pairs (diagonal) and minimize for incorrect pairs (off-diagonal) using contrastive loss
### Zero-shot classification
- For a given image classification task create descriptive text prompts for each class.
- Pass the input image through CLIP's Image Encoder to get its [[Word Embeddings]].
- Pass each class prompt through CLIP's Text Encoder to get text embeddings for each class.
- Compute the cosine similarity between the image embedding and each class prompt embedding.
- The class corresponding to the prompt with the highest similarity to the image embedding is chosen as the prediction.
### Fine-tuning CLIP
- Full fine-tuning - unfreeze and update all parameters of both encoders. Requires significant computational resources, but provides the best performance.
- Linear Probing - freeze both encoders and train only a linear classifier on top. Works well, but worse than full fine-tuning.
- Parameter-Efficient Fine-tuning: [[LoRA]], Adapter layers, prompt/prefix tuning.
- Few-short learning with task-specific prompts
### Practical considerations
- Handling text with more than 77 tokens: truncation, chunking and averaging/pooling embeddings, hierarchical encoding (use a model to process the sequence of embeddings), sliding window with attention, compress the text.
### Links
- [Original CLIP Paper](https://arxiv.org/abs/2103.00020)
- [OpenAI CLIP GitHub Repository](https://github.com/openai/CLIP)
- [Hugging Face CLIP Documentation](https://huggingface.co/docs/transformers/model_doc/clip)