Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much variance as possible. It is widely used in data preprocessing, visualization, and noise reduction.
Theory and Background
PCA works by performing the following steps:
- Compute the Mean: Center the data by subtracting the mean of each feature.
- Calculate the Covariance Matrix: This matrix measures how different features vary with one another.
- Compute Eigenvalues and Eigenvectors: Solve for the principal components, which will represent the new axes.
- Select Top Principal Components: Choose the top 'k' eigenvectors based on the eigenvalues, which represent the direction of maximum variance.
- Transform the Data: Project the original data onto the new principal component axes to reduce its dimensionality.
Implementation
Below is the Python code that implements PCA from scratch:
import numpy as np
def pca(X, n_components):
# Center the data by subtracting the mean of each feature
X_meaned = X - np.mean(X, axis=0)
# Compute the covariance matrix
cov_matrix = np.cov(X_meaned, rowvar=False)
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort eigenvectors by eigenvalues in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvectors = eigenvectors[:, sorted_indices]
eigenvalues = eigenvalues[sorted_indices]
# Select top n_components eigenvectors
selected_eigenvectors = eigenvectors[:, :n_components]
# Transform data by projecting it onto the new principal component axes
X_reduced = np.dot(X_meaned, selected_eigenvectors)
return X_reduced
Visualization and Application
We can apply PCA on the Iris dataset and visualize the transformed data in a 2D space:
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Apply PCA to reduce the dataset to 2 principal components
X_pca = pca(X, 2)
# Plot the transformed data
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.colorbar(label='Target Label')
plt.show()
Output Visualization
The plot above visualizes the Iris dataset in 2D after applying PCA. The different colors represent the three target labels (species), and the points are scattered based on the first two principal components. The PCA technique has reduced the data from 4 dimensions to 2, allowing easier visualization while preserving as much variance as possible.