12. Dimensionality Reduction (PCA)#
12.1. Principal component analysis (PCA)#
(Monday)
The theory of PCA just the theory of rotation in linear algebra. When I understood PCA, I finally understood Quantum Mechanics.

12.1.1. Cover Robinson 4.6.#
Covariance & PCA
12.1.2. Method:#
- Standardize data 
- Find correlations 
- Find principal components 
- Transform Data 
12.1.3. Advantages and Disadvantages of Principal Component Analysis#
(from https://www.geeksforgeeks.org/principal-component-analysis-pca/)
Advantages
- Multicollinearity Handling: Creates new, uncorrelated variables to address issues when original features are highly correlated. 
- Noise Reduction: Eliminates components with low variance (assumed to be noise), enhancing data clarity. 
- Data Compression: Represents data with fewer components, reducing storage needs and speeding up processing. 
- Outlier Detection: Identifies unusual data points by showing which ones deviate significantly in the reduced space. 
- Disadvantages of Principal Component Analysis 
Disadvantages
- Interpretation Challenges: The new components are combinations of original variables, which can be hard to explain. 
- Data Scaling Sensitivity: Requires proper scaling of data before application, or results may be misleading. 
- Information Loss: Reducing dimensions may lose some important information if too few components are kept. 
- Assumption of Linearity: Works best when relationships between variables are linear, and may struggle with non-linear data. 
- Computational Complexity: Can be slow and resource-intensive on very large datasets. 
- Risk of Overfitting: Using too many components or working with a small dataset might lead to models that don’t generalize well. 
12.1.4. Dimensional Reduction#
12.1.5. PCA in Python#
(Wednesday)
(Example from https://www.geeksforgeeks.org/principal-component-analysis-pca/)
import pandas as pd
import numpy as np
# Here we are using inbuilt dataset of scikit learn
from sklearn.datasets import load_breast_cancer
# instantiating
cancer = load_breast_cancer(as_frame=True)
# creating dataframe
df = cancer.frame
# checking shape
print('Original Dataframe shape :',df.shape)
# Input features
X = df[cancer['feature_names']]
print('Inputs Dataframe shape   :', X.shape)
# Standardization
X_mean = X.mean()
X_std = X.std()
Z = (X - X_mean) / X_std
# covariance
c = Z.cov()
# Plot the covariance matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(c)
plt.show()
eigenvalues, eigenvectors = np.linalg.eig(c)
print('Eigen values:\n', eigenvalues)
print('Eigen values Shape:', eigenvalues.shape)
print('Eigen Vector Shape:', eigenvectors.shape)
# Index the eigenvalues in descending order 
idx = eigenvalues.argsort()[::-1]
# Sort the eigenvalues in descending order 
eigenvalues = eigenvalues[idx]
# sort the corresponding eigenvectors accordingly
eigenvectors = eigenvectors[:,idx]
explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)
explained_var
n_components = np.argmax(explained_var >= 0.50) + 1
n_components
# PCA component or unit matrix
u = eigenvectors[:,:n_components]
pca_component = pd.DataFrame(u,
                             index = cancer['feature_names'],
                             columns = ['PC1','PC2']
                            )
# plotting heatmap
plt.figure(figsize =(5, 7))
sns.heatmap(pca_component)
plt.title('PCA Component')
plt.show()
# Matrix multiplication or dot Product
Z_pca = Z @ pca_component
# Rename the columns name
Z_pca.rename({'PC1': 'PCA1', 'PC2': 'PCA2'}, axis=1, inplace=True)
# Print the  Pricipal Component values
print(Z_pca)
or
from sklearn.decomposition import PCA
pca = PCA(n_components=2)  # Can be any size
pca.fit(Z)  # still need to scale data.
x_pca = pca.transform(Z)
# Create the dataframe
df_pca1 = pd.DataFrame(x_pca,
                       columns=['PC{}'.
                       format(i+1)
                        for i in range(n_components)])
print(df_pca1)
plt.figure(figsize=(8, 6))
plt.scatter(x_pca[:, 0], x_pca[:, 1],
            c=cancer['target'],
            cmap='plasma')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
12.1.6. PCA in R#
Not this year.
12.2. Alternatives to PCA#
(Friday)
PCA benefits and draw backs:
Pros:
- Dimensionality Reduction: PCA effectively reduces the number of features, which is beneficial for models that suffer from the curse of dimensionality. 
- Feature Independence: Principal components are orthogonal (uncorrelated), meaning they capture independent information, simplifying the interpretation of the reduced features. 
- Noise Reduction: PCA can help reduce noise by focusing on the components that explain the most significant variance in the data. 
- Visualization: The reduced-dimensional data can be visualized, aiding in understanding the underlying structure and patterns. 
Cons:
- Loss of Interpretability: Interpretability of the original features may be lost in the transformed space, as principal components are linear combinations of the original features. 
- Assumption of Linearity: PCA assumes that the relationships between variables are linear, which may not be true in all cases. 
- Sensitive to Scaling: PCA is sensitive to the scale of the features, so standardization is often required. 
- Outliers Impact Results: Outliers can significantly impact the results of PCA, as it focuses on capturing the maximum variance, which may be influenced by extreme values. 
When to Use:
- High-Dimensional Data: PCA is particularly useful when dealing with datasets with a large number of features to mitigate the curse of dimensionality. 
- Collinear Features: When features are highly correlated, PCA can be effective in capturing the shared information and representing it with fewer components. 
- Visualization: PCA is beneficial when visualizing high-dimensional data is challenging. It projects data into a lower-dimensional space that can be easily visualized. 
- Linear Relationships: When the relationships between variables are mostly linear, PCA is a suitable technique. 
- https://elitedatascience.com/dimensionality-reduction-algorithms https://medium.com/nerd-for-tech/dimensionality-reduction-techniques-pca-lca-and-svd-f2a56b097f7c 
- https://medium.com/nerd-for-tech/dimensionality-reduction-techniques-pca-lca-and-svd-f2a56b097f7c 
Note
For science the two main issues of PCA are the lack of propagating uncertainties and the inherent linear assumptions.
12.2.1. Incorporating Uncertainties#
The main issue with PCA in science is that it does not account for uncertainties in the measurements.
- https://iopscience.iop.org/article/10.3847/1538-4357/aaec7e/pdf - SNEMO uses Expectation Maximization Factor Analysis (EMFA) 
 
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FactorAnalysis.html 
12.2.2. Doing nonlinear dimensionality reduction#
Also, PCA is inherently linear. With infinitely many PCA features you can shift a spectral line, or you can do a non-linear dimensionality reduction and represent a spectral line shift by a single parameter.
