Project 2 - Principal Component Analysis (PCA) Implementation

PROJECTS


Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much variance as possible. It is widely used in data preprocessing, visualization, and noise reduction.

Theory and Background

PCA works by performing the following steps:

  1. Compute the Mean: Center the data by subtracting the mean of each feature.
  2. Calculate the Covariance Matrix: This matrix measures how different features vary with one another.
  3. Compute Eigenvalues and Eigenvectors: Solve for the principal components, which will represent the new axes.
  4. Select Top Principal Components: Choose the top 'k' eigenvectors based on the eigenvalues, which represent the direction of maximum variance.
  5. Transform the Data: Project the original data onto the new principal component axes to reduce its dimensionality.

Implementation

Below is the Python code that implements PCA from scratch:


import numpy as np

def pca(X, n_components):
    # Center the data by subtracting the mean of each feature
    X_meaned = X - np.mean(X, axis=0)

    # Compute the covariance matrix
    cov_matrix = np.cov(X_meaned, rowvar=False)

    # Compute eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # Sort eigenvectors by eigenvalues in descending order
    sorted_indices = np.argsort(eigenvalues)[::-1]
    eigenvectors = eigenvectors[:, sorted_indices]
    eigenvalues = eigenvalues[sorted_indices]

    # Select top n_components eigenvectors
    selected_eigenvectors = eigenvectors[:, :n_components]

    # Transform data by projecting it onto the new principal component axes
    X_reduced = np.dot(X_meaned, selected_eigenvectors)
    return X_reduced

Visualization and Application

We can apply PCA on the Iris dataset and visualize the transformed data in a 2D space:


import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Apply PCA to reduce the dataset to 2 principal components
X_pca = pca(X, 2)

# Plot the transformed data
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.colorbar(label='Target Label')
plt.show()

Output Visualization

The plot above visualizes the Iris dataset in 2D after applying PCA. The different colors represent the three target labels (species), and the points are scattered based on the first two principal components. The PCA technique has reduced the data from 4 dimensions to 2, allowing easier visualization while preserving as much variance as possible.