Contrastive Language-Image Pre-training

CLIP (Contrastive Language-Image Pre-training) is a neural net model developed by OpenAI that efficiently learns visual concepts from natural language supervision. It’s a dual-encoder model that jointly trains an image encoder and a text encoder to produce similar embeddings for corresponding pairs. It has good zero-shot learning capabilities, allowing it to work for many tasks without being explicitly trained for them.

Architecture

Image Encoder: Vision Transformer (ViT) or ResNet takes images and outputs fixed-dimensional image embeddings (512 or 768 dimensionality)
Text Encoder: Transformer processes text and outputs fixed-dimensional text embeddings. Maximum context length is 77 tokens
Both encoders project their outputs into a shared embedding space, embeddings are L2-normalized to unit length, similarity is computed using dot product (equivalent to cosine similarity for normalized vectors)

Training

CLIP uses contrastive learning on large-scale image-text pairs:

A batch contains N image-text pairs
Both images and texts are encoded into the shared embedding space
Compute similarity matrix between all N×N possible image-text combinations
Maximize similarity for correct pairs (diagonal) and minimize for incorrect pairs (off-diagonal) using contrastive loss

Zero-shot classification

For a given image classification task create descriptive text prompts for each class.
Pass the input image through CLIP’s Image Encoder to get its Word Embeddings.
Pass each class prompt through CLIP’s Text Encoder to get text embeddings for each class.
Compute the cosine similarity between the image embedding and each class prompt embedding.
The class corresponding to the prompt with the highest similarity to the image embedding is chosen as the prediction.

Fine-tuning CLIP

Full fine-tuning - unfreeze and update all parameters of both encoders. Requires significant computational resources, but provides the best performance.
Linear Probing - freeze both encoders and train only a linear classifier on top. Works well, but worse than full fine-tuning.
Parameter-Efficient Fine-tuning: LoRA, Adapter layers, prompt/prefix tuning.
Few-short learning with task-specific prompts

Practical considerations

Handling text with more than 77 tokens: truncation, chunking and averaging/pooling embeddings, hierarchical encoding (use a model to process the sequence of embeddings), sliding window with attention, compress the text.

DSWoK — Data Science Well of Knowledge

Explorer

Contrastive Language-Image Pre-training

Architecture

Training

Zero-shot classification

Fine-tuning CLIP

Practical considerations

Links

Graph View

Table of Contents

Backlinks