Q: What is the number of the parameters of convolution?
A:

Q: We have a multiclass neural net, what should we change in it, if we want to make it multilabel?
A: Change the output activation function from softmax to sigmoid and the loss function from categorical cross-entropy to binary cross-entropy.

Q: Here is a pseudocode for a neural net. Explain how it works, point out mistakes or inefficiencies in the architecture:

net = Sequential() 
net.add(InputLayer([100, 100, 3])) 
net.add(Conv2D(filters=512, kernel_size=(3, 3),  
               kernel_initializer=init.zeros())) 
net.add(Conv2D(filters=128, kernel_size=(1, 1),  
               kernel_initializer=init.zeros())) 
net.add(Activation('relu')) 
net.add(Conv2D(filters=32, kernel_size=(3, 3),  
               kernel_initializer=init.zeros())) 
net.add(Conv2D(filters=32, kernel_size=(1, 1),  
               kernel_initializer=init.zeros())) 
net.add(Activation('relu')) 
net.add(MaxPool2D(pool_size=(6, 6))) 
net.add(Conv2D(filters=8, kernel_size=(10, 10),  
               kernel_initializer=init.normal())) 
net.add(Activation('relu')) 
net.add(Conv2D(filters=8, kernel_size=(10, 10),  
               kernel_initializer=init.normal())) 
net.add(Activation('relu')) 
net.add(MaxPool2D(pool_size=(3, 3))) 
net.add(Flatten()) # convert 3d tensor to a vector of features 
net.add(Dense(units=512)) 
net.add(Activation('softmax')) 
net.add(Dropout(rate=0.5)) 
net.add(Dense(units=512)) 
net.add(Activation('softmax')) 
net.add(Dense(units=10)) 
net.add(Activation('sigmoid')) 
net.add(Dropout(rate=0.5))

A:

  • Initializing weights to zero is generally a bad practice as it can lead to symmetric weights and prevent the network from learning effectively.
  • The first layer has 512 filters immediately, which might be too much for the initial layer. The drop from 512 to 128 to 32 filters can be too drastic.
  • ReLU activations are missing after some convolutional layers, which could limit the network’s ability to learn non-linear patterns.
  • The first MaxPool2D layer has a large pool size of (6,6), which might reduce spatial information too aggressively.
  • The 10x10 kernel size in later convolutional layers is too large.
  • Using softmax activation for intermediate dense layers is unusual.
  • Dropout is applied after the activation functions, which is less common. It’s usually applied before the activation.
  • It could be a good idea to have pooling after each CNN layer
  • We could optionally replace the last max pooling with adaptive pooling so that we could work with images of varied size
  • 3 dense layers at the end is too much, usually 1 or 2 are used.

Q: We have a linear layer with 30 neurons. How can we get/hack the weights if we don’t have a direct access to them?
A: When we input an identity matrix to a linear layer, the output will effectively be the weight matrix itself.

# Create identity matrix
identity = torch.eye(num_inputs)
# Pass through the layer
output = linear_layer(identity)
 
# To get biases, we can pass a zero vector
zero_input = torch.zeros((1, num_inputs))
bias_output = linear_layer(zero_input)

Q: The same with 3x3x3 convolution.
A: For each position in the kernel and each input channel, create an input where only one pixel is 1 and the rest are 0. This is a pseudocode:

 
def extract_cnn_weights(cnn_layer, input_shape=(32, 32, 3), kernel_size=(3, 3), num_filters=3):
    height, width, in_channels = input_shape
    k_h, k_w = kernel_size
 
    weights = []
    for h in range(height):
        for w in range(width):
            for c in range(in_channels):
                input_tensor = np.zeros((1, height, width, in_channels))
                input_tensor[0, h, w, c] = 1
                output = cnn_layer(input_tensor)
                weights.append(output[0, 0, 0, :])
 
    return np.array(weights).reshape(k_h, k_w, in_channels, num_filters)
 
def extract_cnn_biases(cnn_layer, input_shape=(32, 32, 3)):
    zero_input = np.zeros((1,) + input_shape)
    return cnn_layer(zero_input)[0, 0, 0, :]
 

Q: We have a CNN model pre-trained on a certain domain. We want to fine-tune it on a different domain. At the beginning of the fine-tuning, the loss is huge. What to do?
A: Lower learning rate, freeze earlier layers and unfreeze gradually, use gradient clipping/label smoothing/gradient norm.

Q: We have a stream of 50 frames of the same person. We want to make a prediction with a face recognition model, but we have to select a single frame for it. What could be the criteria for it?
A: Select frame with best face visibility/image quality. Possibly train a small model to predict the best suitable frame.

Q: Write a basic PyTorch code for implementing sampling for a simple LLM decoder. Implement the following variations: greedy, top-k, top-p.
A: Inspired by https://medium.com/@pashashaik/natural-language-generation-from-scratch-in-large-language-models-with-pytorch-4d9379635316

Q: What are the trade-offs between using weights in Cross-Entropy Loss vs using Focal Loss?
A: The weights in Cross-Entropy Loss weight classes statically based on the number of samples in them, this approach focuses on minority classes. Focal loss focuses on hard-to-classify examples and requires tuning a hyperparameter.