Dimensionality Reduction in Python
How feature extraction techniques can reduce dimensionality
This is a tutorial to share what I have learnt in Dimensionality Reduction in Python, capturing the learning objectives as well as my personal notes. The course is taught by Jerone Boeye from DataCamp, and it includes 4 chapters.
High-dimensional datasets have high complexity and can be computationally expensive to process. Reduce dimensionality by dropping features that are duplicate of other features, dropping irrelevant features, and using feature extraction techniques (through the calculation of uncorrelated principal components).
I have learnt the following topics:
- Why dimensional reduction is important and when to use it
- How to explore high dimensional data
- How to identify duplicate features (high correlation in correlation matrix)
- How to identify useless features (with little or no variance)
- Remove unimportant features
- Difference between feature selection and feature extraction — advantages and disadvantages
- High dimensional data exploration with t-SNE and PCA
- The curse of dimensionality
- Use models to find important features, eg. using Random Forest to find feature importance
- Regualization in linear models using LassoCV
- Understand and apply principal component analysis (PCA), and using PCA in a pipeline
- Example of PCA in image compression
More notes and codes can be found on my GitHub.
Overall, I have enjoyed learning this course and would highly recommend it!