DSWoK — Data Science Well of Knowledge

#concept

26 notes · co-occurs with 14 tags · last updated Jun 22, 2026

Co-tags#evaluation8 #nlp5 #recsys4 #interview4 #unsupervised3 #embeddings2 #topic-modeling2 #fine-tuning1 #pre-trained1 #algorithm1 #regularization1 #loss1 #llm1 #retrieval1

Notes tagged #concept

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows to adapt large pre-trained models to specific tasks while minimizing computational resources.

Deep Learning

Negative sampling

Negative sampling trains a model by contrasting each observed positive with a small set of sampled alternatives instead of every item in the catalog.

Deep Learning

logQ correction

LogQ correction is a bias correction technique used in recommendation systems to account for non-uniform sampling during training.

Deep Learning

A/B testing (online controlled experimentation) randomly assigns units to a treatment or a control variant and compares aggregate outcomes, attributing the observed difference to the change.

Bias-Variance Trade-off

The bias-variance trade-off is a fundamental concept in machine learning that describes the balance between a model’s ability to fit the training data (low bias) and its ability to generalize to new, unseen data (low variance).

CUPED makes A/B tests detect smaller effects without needing more users.

Cold start is the problem of producing useful recommendations when a user, item, or market has little or no reliable interaction history.

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features (dimensions) in a dataset while preserving as much relevant information as possible.

Distance calculation

Cosine similarity measures the cosine of the angle between two vectors, effectively capturing their orientation similarity while ignoring their magnitude.

Multi-armed bandits

A multi-armed bandit is a sequential decision problem where a learner repeatedly chooses among k actions (arms), observes a stochastic reward for the chosen arm only, and adapts future choices to balance exploration (sampling under-tested arms to learn their value) against exploitation (sampling the...

Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function.

Time-series validation

This is one of the types of Validation, which deserves a special explanation due to the sheer variability and complexity.

Training-serving skew

Training-serving skew is a mismatch between the data representation used to train and evaluate a model and the representation available at serving time.

Model validation is the process of assessing how well a trained machine learning model performs on unseen data.

Behavioral interviews

Behavioral interview is a stage of the job interview process in which candidates are asked to describe specific situations from their past experiences to demonstrate their skills, abilities, and character traits.

Interview_preparation

Leetcode code templates

Two pointers: one input, opposite ends def fn(arr): left = ans = 0 right = len(arr) - 1 while left < right: # do some logic here with left and right if CONDITION: left += 1 else: right -= 1 return ans Two pointers: two inputs, exhaust both def fn(arr1, arr2): i = j = ans = 0 while i < len(arr1...

Interview_preparation

ML System design

ML System Design interview is a stage of the job interview process focused on assessing a candidate’s ability to design and implement machine learning systems at scale.

Interview_preparation

Questions to ask the interviewers

During the interview process, it is important not just to answer questions but also to ask your own questions.

Interview_preparation

A classifier is calibrated if its predicted probabilities match observed frequencies: among examples assigned a 0.7 score, roughly 70% should be positive.

Metrics and losses

Confusion matrix

A Confusion Matrix is a table used to evaluate the performance of a classification model on a set of data for which the true values are known.

Metrics and losses

Loss functions (also called objective functions or cost functions) are mathematical measures of the error between predicted and actual values.

Metrics and losses

Retrieval-Augmented Generation (RAG) is a hybrid approach that combines LLM’s generative abilities with real-time information retrieval from external knowledge sources.

NLP

Topic Modeling Methods

A survey of the main topic modeling methods, ordered roughly by historical development (matrix factorization → probabilistic generative models → neural → embedding-based).

NLP

Topic modeling is an unsupervised technique for discovering abstract themes in a document collection, where a document is whatever unit of text the project treats as one (article, review, tweet, paragraph, support ticket).

NLP

Word Embeddings

Word embedding is a representation of a word, usually with a vector of values.

NLP

Recommendation system

This note covers how recommendation systems are designed, built, evaluated, deployed, and debugged in production.

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community