K-Nearest Neighbors (KNN) Algorithm
The K-Nearest Neighbors (KNN) algorithm is a simple, yet powerful supervised learning algorithm used for both classification and regression tasks. The main idea behind KNN is that similar data points tend to be found near each other in feature space.
Theory and Concepts
Here is a breakdown of the steps involved in the KNN algorithm:
- Calculate Distance: Compute the distance between the test point and all points in the training dataset (commonly using Euclidean distance).
- Find Neighbors: Select the K closest points based on the computed distances.
- Make a Decision:
- Classification: Assign the most common label among the K nearest neighbors.
- Regression: Take the average of the values of the K nearest neighbors.
Implementation
Below is the Python code that implements the KNN algorithm from scratch:
import numpy as np
from collections import Counter
# Euclidean distance function to calculate the distance between two points
def euclidean_distance(a, b):
return np.sqrt(np.sum((a - b) ** 2))
class KNN:
# Constructor to initialize the number of neighbors (k)
def __init__(self, k=3):
self.k = k
# Fit function to store the training data
def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train
# Predict function to classify the test data points
def predict(self, X_test):
predictions = [self._predict(x) for x in X_test]
return np.array(predictions)
# Function to predict the label for a single test point
def _predict(self, x):
# Compute the distance between the test point and all points in the training dataset
distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
# Find the indices of the k nearest neighbors
k_indices = np.argsort(distances)[:self.k]
# Get the labels of the k nearest neighbors
k_nearest_labels = [self.y_train[i] for i in k_indices]
# Return the most common label
return Counter(k_nearest_labels).most_common(1)[0][0]
Testing and Evaluation
We will use the popular Iris dataset to test the KNN implementation and evaluate its accuracy:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load Iris dataset
data = load_iris()
X, y = data.data, data.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the KNN model
knn = KNN(k=5)
knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
print("Accuracy:", accuracy_score(y_test, y_pred))
Output Visualization
When the model is trained and tested on the Iris dataset, the output will show the accuracy of the KNN classifier on the test set. Here is the expected output:
Accuracy: 1.0
This indicates that the KNN algorithm performed perfectly on this test set with an accuracy of 100%. The performance might vary with different datasets and values of K.
Visualization of Results
We can also visualize the classification results using a 2D plot. For this, let's reduce the dimensions of the Iris dataset to two features and plot the decision boundaries.
import matplotlib.pyplot as plt
# Reduce the dataset to two features for visualization purposes
X_reduced = X[:, :2]
# Split the dataset again
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42)
# Initialize and train the KNN model
knn = KNN(k=5)
knn.fit(X_train, y_train)
# Generate a meshgrid for plotting decision boundaries
h = .02
x_min, x_max = X_reduced[:, 0].min() - 1, X_reduced[:, 0].max() + 1
y_min, y_max = X_reduced[:, 1].min() - 1, X_reduced[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Plot the decision boundaries
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, edgecolors='k', marker='o', s=50, cmap=plt.cm.coolwarm)
plt.title("KNN Classification with k=5 (2D Features)")
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
This visualization shows the decision boundaries created by the KNN algorithm using two features. The different regions represent the classification areas for each of the three Iris classes. Points are colored based on their true labels, and the boundaries indicate where the KNN algorithm classifies the data points.