Deep Learning questions

Q: What is the number of the parameters of convolution?
A: $(3 * 3 * 3 + 1) * 3$

Q: We have a multiclass neural net, what should we change in it, if we want to make it multilabel?
A: Change the output activation function from softmax to sigmoid and the loss function from categorical cross-entropy to binary cross-entropy.

Q: Here is a pseudocode for a neural net. Explain how it works, point out mistakes or inefficiencies in the architecture:

net = Sequential() 
net.add(InputLayer([100, 100, 3])) 
net.add(Conv2D(filters=512, kernel_size=(3, 3),  
               kernel_initializer=init.zeros())) 
net.add(Conv2D(filters=128, kernel_size=(1, 1),  
               kernel_initializer=init.zeros())) 
net.add(Activation('relu')) 
net.add(Conv2D(filters=32, kernel_size=(3, 3),  
               kernel_initializer=init.zeros())) 
net.add(Conv2D(filters=32, kernel_size=(1, 1),  
               kernel_initializer=init.zeros())) 
net.add(Activation('relu')) 
net.add(MaxPool2D(pool_size=(6, 6))) 
net.add(Conv2D(filters=8, kernel_size=(10, 10),  
               kernel_initializer=init.normal())) 
net.add(Activation('relu')) 
net.add(Conv2D(filters=8, kernel_size=(10, 10),  
               kernel_initializer=init.normal())) 
net.add(Activation('relu')) 
net.add(MaxPool2D(pool_size=(3, 3))) 
net.add(Flatten()) # convert 3d tensor to a vector of features 
net.add(Dense(units=512)) 
net.add(Activation('softmax')) 
net.add(Dropout(rate=0.5)) 
net.add(Dense(units=512)) 
net.add(Activation('softmax')) 
net.add(Dense(units=10)) 
net.add(Activation('sigmoid')) 
net.add(Dropout(rate=0.5))

Initializing weights to zero is generally a bad practice as it can lead to symmetric weights and prevent the network from learning effectively.
The first layer has 512 filters immediately, which might be too much for the initial layer. The drop from 512 to 128 to 32 filters can be too drastic.
ReLU activations are missing after some convolutional layers, which could limit the network’s ability to learn non-linear patterns.
The first MaxPool2D layer has a large pool size of (6,6), which might reduce spatial information too aggressively.
The 10x10 kernel size in later convolutional layers is too large.
Using softmax activation for intermediate dense layers is unusual.
Dropout is applied after the activation functions, which is less common. It’s usually applied before the activation.
It could be a good idea to have pooling after each CNN layer
We could optionally replace the last max pooling with adaptive pooling so that we could work with images of varied size
3 dense layers at the end is too much, usually 1 or 2 are used.

Q: We have a linear layer with 30 neurons. How can we get/hack the weights if we don’t have a direct access to them?
A: When we input an identity matrix to a linear layer, the output will effectively be the weight matrix itself.

# Create identity matrix
identity = torch.eye(num_inputs)
# Pass through the layer
output = linear_layer(identity)
 
# To get biases, we can pass a zero vector
zero_input = torch.zeros((1, num_inputs))
bias_output = linear_layer(zero_input)

Q: The same with 3x3x3 convolution.
A: For each position in the kernel and each input channel, create an input where only one pixel is 1 and the rest are 0. This is a pseudocode:

 
def extract_cnn_weights(cnn_layer, input_shape=(32, 32, 3), kernel_size=(3, 3), num_filters=3):
    height, width, in_channels = input_shape
    k_h, k_w = kernel_size
 
    weights = []
    for h in range(height):
        for w in range(width):
            for c in range(in_channels):
                input_tensor = np.zeros((1, height, width, in_channels))
                input_tensor[0, h, w, c] = 1
                output = cnn_layer(input_tensor)
                weights.append(output[0, 0, 0, :])
 
    return np.array(weights).reshape(k_h, k_w, in_channels, num_filters)
 
def extract_cnn_biases(cnn_layer, input_shape=(32, 32, 3)):
    zero_input = np.zeros((1,) + input_shape)
    return cnn_layer(zero_input)[0, 0, 0, :]

Q: We have a CNN model pre-trained on a certain domain. We want to fine-tune it on a different domain. At the beginning of the fine-tuning, the loss is huge. What to do?
A: Lower learning rate, freeze earlier layers and unfreeze gradually, use gradient clipping/label smoothing/gradient norm.

Q: We have a stream of 50 frames of the same person. We want to make a prediction with a face recognition model, but we have to select a single frame for it. What could be the criteria for it?
A: Select frame with best face visibility/image quality. Possibly train a small model to predict the best suitable frame.

Q: Write a basic PyTorch code for implementing sampling for a simple LLM decoder. Implement the following variations: greedy, top-k, top-p.
A: Inspired by https://medium.com/@pashashaik/natural-language-generation-from-scratch-in-large-language-models-with-pytorch-4d9379635316

Code

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
 
class SimpleDecoder(nn.Module):
   def __init__(self):
       super().__init__()
       # Assume the model architecture is already defined here
 
   def forward(self, input_ids):
       # Assume this method is already implemented
       # It should return logits of shape (batch_size, sequence_length, vocab_size)
       pass
 
  def greedy_search(self, input_ids, max_length, tokenizer):  
      with torch.inference_mode():  
          for _ in range(max_length):  
              outputs = self(input_ids) # (batch_size, sequence_length, vocab_size)
              next_token_logits = outputs[:, -1, :] # (batch_size, vocab_size) logits for the next token for each sequence in the batch.
 
              next_token = torch.argmax(next_token_logits, dim=-1)  
 
              if next_token.item() == self.eos_token_id:  
                  break  
  			# the following two lines give the same results, select the one which syntax you prefer
              # input_ids = torch.cat([input_ids, rearrange(next_token, 'c -> 1 c')], dim=-1)
              input_ids = torch.cat([input_ids, next_token.unsqueeze(-1), dim=-1)  
 
          generated_text = tokenizer.decode(input_ids[0])  
 
      return generated_text
 
  def top_k_sampling(self, input_ids, max_length, tokenizer, k=50, temperature=1.0):  
      with torch.inference_mode():  
          for _ in range(max_length):  
              outputs = self(input_ids)  
              next_token_logits = outputs[:, -1, :] / temperature  
 
              # Get top k tokens  
              top_k_logits, top_k_indices = torch.topk(next_token_logits, k)  
 
              # Apply softmax to convert logits to probabilities  
              probs = F.softmax(top_k_logits, dim=-1)  
 
              # Sample from the top k  
              next_token_index = torch.multinomial(probs, num_samples=1)  
 
              # Convert back to vocabulary space  
              next_token = torch.gather(top_k_indices, -1, next_token_index)  
 
              # Check for EOS token  
              if next_token.item() == self.eos_token_id:  
                  break  
 
              # Concatenate the next token to the input  
              input_ids = torch.cat([input_ids, next_token], dim=-1)  
 
          generated_text = tokenizer.decode(input_ids[0])  
 
      return generated_text
 
 
def top_p_sampling(self, input_ids, max_length, tokenizer, p=0.9, temperature=1.0): 
      """
      Top-p (nucleus sampling) controls the cumulative probability of the generated tokens.
      """ 
   with torch.inference_mode():  
       for _ in range(max_length):  
           outputs = self(input_ids)  
           next_token_logits = outputs[:, -1, :] / temperature  
 
           # Sort logits in descending order  
           sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)  
           cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)  
 
           # Remove tokens with cumulative probability above the threshold  
           sorted_indices_to_remove = cumulative_probs > p  
           sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()  
           sorted_indices_to_remove[..., 0] = 0  
 
           # Scatter sorted tensors to original indexing  
           indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)  
           next_token_logits[indices_to_remove] = float('-inf')  
 
           # Sample from the filtered distribution  
           probs = F.softmax(next_token_logits, dim=-1)  
           next_token = torch.multinomial(probs, num_samples=1)  
 
           # Check for EOS token  
           if next_token.item() == self.eos_token_id:  
               break  
 
           # Concatenate the next token to the input  
           input_ids = torch.cat([input_ids, next_token], dim=-1)  
 
       generated_text = tokenizer.decode(input_ids[0])  
 
   return generated_text
 
def beam_search(self, input_ids, max_length, tokenizer, beam_size=5):  
   beam_scores = torch.zeros(beam_size, device=device)  
   beam_sequences = input_ids.repeat(beam_size, 1)  
   active_beams = torch.ones(beam_size, dtype=torch.bool, device=device)  
 
   for _ in range(max_length):  
       with torch.no_grad():  
           outputs = self(beam_sequences)  
           next_token_logits = outputs[:, -1, :]  
 
           # Calculate log probabilities  
           log_probs = F.softmax(next_token_logits, dim=-1)  
 
           # Calculate scores for all possible next tokens  
           vocab_size = log_probs.size(-1)  
           next_scores = beam_scores.unsqueeze(-1) + log_probs  
           next_scores = next_scores.view(-1)  
 
           # Select top-k best scores  
           top_scores, top_indices = next_scores.topk(beam_size, sorted=True)  
 
           # Convert flat indices to beam indices and token indices  
           beam_indices = top_indices // vocab_size  
           token_indices = top_indices % vocab_size  
 
           # Update sequences  
           beam_sequences = torch.cat([  
               beam_sequences[beam_indices],  
               token_indices.unsqueeze(-1)  
           ], dim=-1)  
 
           # Update scores  
           beam_scores = top_scores  
 
           # Update active beams  
           active_beams = token_indices != self.eos_token_id  
 
           if not active_beams.any():  
               break  
 
   # Select the sequence with the highest score  
   best_sequence = beam_sequences[beam_scores.argmax()]  
   generated_text = tokenizer.decode(best_sequence)  
 
   return generated_text

Q: What are the trade-offs between using weights in Cross-Entropy Loss vs using Focal Loss?
A: The weights in Cross-Entropy Loss weight classes statically based on the number of samples in them, this approach focuses on minority classes. Focal loss focuses on hard-to-classify examples and requires tuning a hyperparameter.

DSWoK — Data Science Well of Knowledge

Explorer

Deep Learning questions

Graph View

Backlinks