K-Nearest Neighbors - Data science Well of Knowledge

K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm used for both classification and regression tasks. It makes predictions for a new data point based on the K closest data points (neighbors). Not to be confused with [[K-means clustering]]. KNN is a supervised algorithm for classification/regression finding nearest neighbors, K-means is an unsupervised algorithm for clustering finding cluster centroids. ## How It Works 1. Choose the number K of neighbors 2. Calculate the distance between the query instance and all training samples 3. Sort the distances and determine the K nearest neighbors 4. For classification: - Aggregate the class labels of K neighbors - Return the majority vote as the prediction $\hat{y} = \text{mode}(y_i), i \in K \text{ nearest neighbors}$ 5. For regression: - Aggregate the values of K neighbors - Return the mean value as the prediction $\hat{y} = \frac{1}{K} \sum_{i \in K \text{ nearest neighbors}} y_i$ As KNN algorithm does not build a model during the training phase, it is called a lazy learner. The algorithm memories the entire training dataset and performs an action on the dataset at the time of prediction. ![[Pasted image 20240720174904.png]] ### Distance Metrics 1. Euclidean Distance: $d(x, y) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$ 2. Manhattan Distance: $d(x, y) = \sum_{i=1}^n |x_i - y_i|$ 3. Minkowski Distance: $d(x, y) = (\sum_{i=1}^n |x_i - y_i|^p)^{\frac{1}{p}}$ ### Computational Approaches 1. Brute Force: Calculates distances to all points. Simple but inefficient for large datasets. 2. [K-D Tree](https://scikit-learn.org/stable/modules/neighbors.html#k-d-tree): Space-partitioning data structure for organizing points in K-dimensional space: "The basic idea is that if point 𝐴 is very distant from point 𝐵, and point 𝐵 is very close to point 𝐶, then we know that points 𝐴 and 𝐶 are very distant, without having to explicitly calculate their distance". Efficient for low-dimensional data. 3. [Ball Tree](https://scikit-learn.org/stable/modules/neighbors.html#ball-tree): Hierarchical data structure using hyper-spheres. Efficient for high-dimensional data. ## Advantages 1. Simple implement 2. No assumptions about data distribution (non-parametric) 5. Can capture complex decision boundaries ## Disadvantages 1. Computationally expensive for large datasets 2. Sensitive to the scale of features 3. Curse of dimensionality: performance degrades with high-dimensional data ## Choosing K 1. Odd K for binary classification to avoid ties 2. Square root of N (where N is the number of samples) as a rule of thumb 3. Use cross-validation to find optimal K 4. Elbow Method: calculate error for different values of K, make a plot, look for an "elbow" on it - where the error begins to level off. ![[Pasted image 20240720172840.png]] > [!example]- Code example > ```python > import numpy as np > from collections import Counter > > class KNN: > def __init__(self, k=3): > self.k = k > > def fit(self, X, y): > self.X_train = X > self.y_train = y > > def euclidean_distance(self, x1, x2): > return np.sqrt(np.sum((x1 - x2) ** 2)) > > def predict(self, X): > predictions = [self._predict(x) for x in X] > return np.array(predictions) > > def _predict(self, x): > distances = [self.euclidean_distance(x, x_train) for x_train in self.X_train] > k_indices = np.argsort(distances)[:self.k] > k_nearest_labels = [self.y_train[i] for i in k_indices] > most_common = Counter(k_nearest_labels).most_common(1) > return most_common[0][0] ## Links * [Sklearn documentation](https://scikit-learn.org/stable/modules/neighbors.html)