Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much variance as possible. It is widely used in data preprocessing, visualization, and noise reduction.
Theory and Background
PCA works by performing the following steps:
- Compute the Mean: Center the data by subtracting the mean of each feature.
- Calculate the Covariance Matrix: This matrix measures how different features vary with one another.
- Compute Eigenvalues and Eigenvectors: Solve for the principal components, which will represent the new axes.
- Select Top Principal Components: Choose the top 'k' eigenvectors based on the eigenvalues, which represent the direction of maximum variance.
- Transform the Data: Project the original data onto the new principal component axes to reduce its dimensionality.
Implementation
Below is the Python code that implements PCA from scratch:
import numpy as np def pca(X, n_components): # Center the data by subtracting the mean of each feature X_meaned = X - np.mean(X, axis=0) # Compute the covariance matrix cov_matrix = np.cov(X_meaned, rowvar=False) # Compute eigenvalues and eigenvectors eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix) # Sort eigenvectors by eigenvalues in descending order sorted_indices = np.argsort(eigenvalues)[::-1] eigenvectors = eigenvectors[:, sorted_indices] eigenvalues = eigenvalues[sorted_indices] # Select top n_components eigenvectors selected_eigenvectors = eigenvectors[:, :n_components] # Transform data by projecting it onto the new principal component axes X_reduced = np.dot(X_meaned, selected_eigenvectors) return X_reduced
Visualization and Application
We can apply PCA on the Iris dataset and visualize the transformed data in a 2D space:
import matplotlib.pyplot as plt from sklearn.datasets import load_iris # Load the Iris dataset data = load_iris() X, y = data.data, data.target # Apply PCA to reduce the dataset to 2 principal components X_pca = pca(X, 2) # Plot the transformed data plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k') plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.title('PCA on Iris Dataset') plt.colorbar(label='Target Label') plt.show()
Output Visualization
The plot above visualizes the Iris dataset in 2D after applying PCA. The different colors represent the three target labels (species), and the points are scattered based on the first two principal components. The PCA technique has reduced the data from 4 dimensions to 2, allowing easier visualization while preserving as much variance as possible.