12. Dimensionality Reduction (PCA)#

12.1. Principal component analysis (PCA)#

(Monday)

The theory of PCA just the theory of rotation in linear algebra. When I understood PCA, I finally understood Quantum Mechanics.

Principle Components from https://www.geeksforgeeks.org/principal-component-analysis-pca/

12.1.1. Cover Robinson 4.6.#

Covariance & PCA

12.1.2. Method:#

  1. Standardize data

  2. Find correlations

  3. Find principal components

  4. Transform Data

12.1.3. Advantages and Disadvantages of Principal Component Analysis#

(from https://www.geeksforgeeks.org/principal-component-analysis-pca/)

Advantages

  • Multicollinearity Handling: Creates new, uncorrelated variables to address issues when original features are highly correlated.

  • Noise Reduction: Eliminates components with low variance (assumed to be noise), enhancing data clarity.

  • Data Compression: Represents data with fewer components, reducing storage needs and speeding up processing.

  • Outlier Detection: Identifies unusual data points by showing which ones deviate significantly in the reduced space.

  • Disadvantages of Principal Component Analysis

Disadvantages

  • Interpretation Challenges: The new components are combinations of original variables, which can be hard to explain.

  • Data Scaling Sensitivity: Requires proper scaling of data before application, or results may be misleading.

  • Information Loss: Reducing dimensions may lose some important information if too few components are kept.

  • Assumption of Linearity: Works best when relationships between variables are linear, and may struggle with non-linear data.

  • Computational Complexity: Can be slow and resource-intensive on very large datasets.

  • Risk of Overfitting: Using too many components or working with a small dataset might lead to models that don’t generalize well.

12.1.4. Dimensional Reduction#

12.1.5. PCA in Python#

(Wednesday)

(Example from https://www.geeksforgeeks.org/principal-component-analysis-pca/)

import pandas as pd
import numpy as np

# Here we are using inbuilt dataset of scikit learn
from sklearn.datasets import load_breast_cancer

# instantiating
cancer = load_breast_cancer(as_frame=True)
# creating dataframe
df = cancer.frame

# checking shape
print('Original Dataframe shape :',df.shape)

# Input features
X = df[cancer['feature_names']]
print('Inputs Dataframe shape   :', X.shape)
# Standardization
X_mean = X.mean()
X_std = X.std()
Z = (X - X_mean) / X_std
# covariance
c = Z.cov()

# Plot the covariance matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(c)
plt.show()
eigenvalues, eigenvectors = np.linalg.eig(c)
print('Eigen values:\n', eigenvalues)
print('Eigen values Shape:', eigenvalues.shape)
print('Eigen Vector Shape:', eigenvectors.shape)
# Index the eigenvalues in descending order 
idx = eigenvalues.argsort()[::-1]

# Sort the eigenvalues in descending order 
eigenvalues = eigenvalues[idx]

# sort the corresponding eigenvectors accordingly
eigenvectors = eigenvectors[:,idx]

explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)
explained_var
n_components = np.argmax(explained_var >= 0.50) + 1
n_components
# PCA component or unit matrix
u = eigenvectors[:,:n_components]
pca_component = pd.DataFrame(u,
                             index = cancer['feature_names'],
                             columns = ['PC1','PC2']
                            )

# plotting heatmap
plt.figure(figsize =(5, 7))
sns.heatmap(pca_component)
plt.title('PCA Component')
plt.show()
# Matrix multiplication or dot Product
Z_pca = Z @ pca_component
# Rename the columns name
Z_pca.rename({'PC1': 'PCA1', 'PC2': 'PCA2'}, axis=1, inplace=True)
# Print the  Pricipal Component values
print(Z_pca)

or

from sklearn.decomposition import PCA

pca = PCA(n_components=2)  # Can be any size
pca.fit(Z)  # still need to scale data.
x_pca = pca.transform(Z)

# Create the dataframe
df_pca1 = pd.DataFrame(x_pca,
                       columns=['PC{}'.
                       format(i+1)
                        for i in range(n_components)])
print(df_pca1)
plt.figure(figsize=(8, 6))

plt.scatter(x_pca[:, 0], x_pca[:, 1],
            c=cancer['target'],
            cmap='plasma')

plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

12.1.6. PCA in R#

Not this year.

12.2. Alternatives to PCA#

(Friday)

PCA benefits and draw backs:

Pros:

  1. Dimensionality Reduction: PCA effectively reduces the number of features, which is beneficial for models that suffer from the curse of dimensionality.

  2. Feature Independence: Principal components are orthogonal (uncorrelated), meaning they capture independent information, simplifying the interpretation of the reduced features.

  3. Noise Reduction: PCA can help reduce noise by focusing on the components that explain the most significant variance in the data.

  4. Visualization: The reduced-dimensional data can be visualized, aiding in understanding the underlying structure and patterns.

Cons:

  1. Loss of Interpretability: Interpretability of the original features may be lost in the transformed space, as principal components are linear combinations of the original features.

  2. Assumption of Linearity: PCA assumes that the relationships between variables are linear, which may not be true in all cases.

  3. Sensitive to Scaling: PCA is sensitive to the scale of the features, so standardization is often required.

  4. Outliers Impact Results: Outliers can significantly impact the results of PCA, as it focuses on capturing the maximum variance, which may be influenced by extreme values.

When to Use:

  1. High-Dimensional Data: PCA is particularly useful when dealing with datasets with a large number of features to mitigate the curse of dimensionality.

  2. Collinear Features: When features are highly correlated, PCA can be effective in capturing the shared information and representing it with fewer components.

  3. Visualization: PCA is beneficial when visualizing high-dimensional data is challenging. It projects data into a lower-dimensional space that can be easily visualized.

  4. Linear Relationships: When the relationships between variables are mostly linear, PCA is a suitable technique.

Note

For science the two main issues of PCA are the lack of propagating uncertainties and the inherent linear assumptions.

12.2.1. Incorporating Uncertainties#

The main issue with PCA in science is that it does not account for uncertainties in the measurements.

12.2.2. Doing nonlinear dimensionality reduction#

Also, PCA is inherently linear. With infinitely many PCA features you can shift a spectral line, or you can do a non-linear dimensionality reduction and represent a spectral line shift by a single parameter.

12.3. Further Reading#