KNN Classifier
🎯 Purpose¶
This QuickRef explains how to use the K-Nearest Neighbors (KNN) algorithm for classification tasks. It covers fit logic, distance metrics, scaling importance, and evaluation strategies.
📦 1. When to Use¶
| Condition | Use KNN? |
|---|---|
| Small to medium dataset | ✅ Yes |
| Predictors are numeric + scale-consistent | ✅ Yes |
| Need interpretable local decisions | ✅ Yes |
| High dimensional or noisy data | ❌ Try trees or regularized models |
🧮 2. Core Logic¶
- KNN is a lazy learner — it stores training data and makes predictions at inference time based on proximity
- Uses majority vote among k closest training points to assign class
📏 3. Distance Metrics¶
| Metric | Use When... |
|---|---|
| Euclidean (default) | Standard numeric data |
| Manhattan | Grid-like or sparse data |
| Minkowski | Generalized form (power = 1 or 2) |
| Cosine | Text embeddings, angular similarity |
✔️ Always scale features before fitting to avoid distance bias
🛠️ 4. Fitting in sklearn¶
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
model.fit(X_train, y_train)
🔧 5. Key Hyperparameters¶
| Param | Description |
|---|---|
n_neighbors |
Number of nearest neighbors to use |
weights |
'uniform' (default) or 'distance' (closer neighbors weigh more) |
metric |
Distance function (Euclidean, Manhattan, etc.) |
📊 6. Evaluation Tips¶
- Use cross-validation to tune
k - Use confusion matrix, precision/recall, AUC if needed
- Sensitive to class imbalance → consider stratified sampling
✅ Modeling Checklist¶
- [ ] Features scaled before training (e.g.,
StandardScaler) - [ ]
n_neighborstuned with validation set or CV - [ ] Distance metric chosen based on feature type
- [ ] Class imbalance reviewed
- [ ] Evaluation scores visualized with multiple
kvalues
💡 Tip¶
“KNN makes no assumptions — but gives no explanations either.”