Two-tower

Two-tower architecture is a neural network approach used in recommendation systems for candidate generation and retrieval. It consists of two separate neural networks (towers) that independently encode users and items into dense embeddings in a shared vector space, where similarity between embeddings indicates relevance.

The core idea is to learn separate representations for users and items that can be efficiently compared using vector similarity (typically cosine similarity or dot product). This separation allows for pre-computation of item embeddings and efficient real-time retrieval using approximate nearest neighbor search.## Architecture and Variations

Basic Two-Tower Architecture

flowchart TB
    subgraph User Tower
        UF[User Features] --> UE1[Dense Layer]
        UE1 --> UE2[Dense Layer]
        UE2 --> UEmb[User Embedding]
    end
    
    subgraph Item Tower
        IF[Item Features] --> IE1[Dense Layer]
        IE1 --> IE2[Dense Layer]
        IE2 --> IEmb[Item Embedding]
    end
    
    UEmb --> Sim[Similarity Score]
    IEmb --> Sim
    Sim --> Loss[Contrastive Loss]

Both towers typically consist of:

Input Layer: Feature concatenation or embedding lookup
Hidden Layers: Multiple fully connected layers with activation functions (ReLU, GELU)
Output Layer: Dense embedding of fixed dimension (64-512 dimensions)
Normalization: L2 normalization for cosine similarity

Architecture Variations

Asymmetric Towers: different architectures for user and item towers - simpler for items (focus on behavior), more complex for items (processing rich features)
Hierarchical Towers: multi-level embeddings (category, subcategory, item)
Attention-Enhanced Towers: self-attention within towers for sequential data or cross-attention between user history and item features (rarely used)

Input Data and Preparation

User Tower Input

Static Features: demographics and profile information
Behavioral features: historical interactions, aggregated statistics, temporal patterns
Sequential Features: recent interaction history, session-level interactions, time-based features (recency, seasonality)
Contextual Features: current session context, earch queries, current browsing context

Item Tower Input

Content Features: text descriptions, titles, keywords, image embeddings
Metadata, popularity metrics, creator/source metrics

It is important to notice, that there can’t be interaction features, as the towers should be independent from each other.

Data Preprocessing

Feature Engineering: text and image embedding extraction, embedding layers for high-cardinality features, numerical normalization for continuous features
Sequence Processing: User History Aggregation (Average pooling, attention-weighted pooling, RNN encoding), truncation/padding (fixed-length sequences for batch processing), temporal encoding (position embeddings, time decay factors)

Training

Contrastive Learning

L = - lo g \frac{exp ( u \cdot i ^{+} / τ )}{exp ( u \cdot i ^{+} / τ ) + \sum _{j} exp ( u \cdot i _{j}^{-} / τ )}

Where:

$u$ is the user embedding
$i^{+}$ is the positive item embedding
$i_{j}^{-}$ are negative item embeddings
$τ$ is the temperature parameter

It is possible to use Hinge-like loss: $L oss = ma x (0, ma r g in - (u \cdot v_{p} os i t i v e) + (u \cdot v_{n} e g a t i v e))$
Produces great embeddings, but requires a lot of computational resources.

Binary Cross-Entropy Loss (Pointwise)

Treat each (user, item) pair as a binary classification problem (interacted vs. not interacted). Requires explicit positive and negative labels.
$L oss = - [y * l o g (σ (u \cdot v)) + (1 - y) * l o g (1 - σ (u \cdot v))]$ where y=1 for positive, y=0 for negative, and σ is the sigmoid function.

In-batch Softmax Loss (Sampled Softmax / Listwise)

For each user (positive item) in a batch, treat all other items in that same batch (or a separate batch of negative samples) as negatives. Calculate the dot product of the user embedding with the positive item embedding and all negative item embeddings. Apply a softmax over these scores. The loss is then the negative log probability of the positive item.
$L oss = - l o g (e x p (u \cdot v_{p} os i t i v e / τ) / (e x p (u \cdot v_{p} os i t i v e / τ) + Σ e x p (u \cdot v_{n} e g a t i v e_{j} / τ)))$
Very computationally efficient, but is dependent on the negative samples in the batch and requires the logQ Correction.

Sampling Strategies

Positive Sampling:

Explicit Feedback: Direct interactions (clicks, purchases, likes)
Implicit Feedback: Views, time spent, completion rate
Temporal Weighting: Recent interactions weighted higher

negative sampling Strategies:

Random Sampling: Uniformly sample from all items
Popularity-based Sampling: Sample popular items more frequently
Hard Negative Mining: Sample items similar to positive items but not interacted with
In-batch Sampling: Use other items in the same batch as negatives

In-batch Negative Sampling: For a batch of N user-item pairs, each user’s positive item serves as negatives for other users in the batch. This creates N×(N-1) negative pairs per batch.

logQ correction

Inference and Serving

Two-Phase Serving

Offline Phase:

Item Embedding Computation: Pre-compute embeddings for all items
Index Building: Build approximate nearest neighbor (ANN) index (FAISS, Annoy, ScaNN)
Caching: Store embeddings in fast retrieval systems

Online Phase:

User Embedding: Compute user embedding using user tower
Candidate Retrieval: Query ANN index to find top-k similar items
Post-processing: Apply business rules, diversity filters

Practical considerations

Use dropout and batch normalization for regularization
Experiment with different embedding dimensions (64, 128, 256, 512)
Use mixed negative sampling strategies
Apply logQ correction when using biased sampling
Pre-compute and cache item embeddings

DSWoK — Data Science Well of Knowledge

Explorer