-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3c2ee36
commit 1a0c9f0
Showing
1 changed file
with
239 additions
and
0 deletions.
There are no files selected for viewing
239 changes: 239 additions & 0 deletions
239
CMP3751 Machine Learning/workshops/Week 7/Workshop 6.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,239 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<span style=\"font-size: 28px;\"> __Machine Learning Workshp 6 : Dimensionality Reduction__.</span>\n", | ||
"<br>\n", | ||
"<br>\n", | ||
"<div style=\"text-align: center;\">\n", | ||
"<span style=\"font-size: 22px;\"> Welecome to our sixth workshop! </span> </div>\n", | ||
"\n", | ||
"In this workshop we will investigate the fundamentals of the __Dimensionality Reduction__ and how to apply its concept through Principal Component Analysis." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# The wine dataset\n", | ||
"It is a classic and very easy __multi-class__ classification dataset.\n", | ||
"It consists of __three (3)__ classes, __178 total instances__ and __13 real, positive__ features:\n", | ||
"1. It is commonly associated with the __UCL Machine Learning Repository__.\n", | ||
"1. The classes represent __three different types__ of wine.\n", | ||
"1. There are __13 different chemical measurements (features)__ describing each wine sample in terms of alcohol content, malic acid concentration, color intensity, etc.\n", | ||
"1. The dataset typically consists of __178 samples__, with each sample corresponding to one bottle of wine.\n", | ||
"1. Typically, the data in the wine dataset is __preprocessed__ by __standardizing__ the features to have __zero mean__ and __unit variance__. This preprocessing step __ensures__ that each __feature contributes equally__ to the analysis.\n", | ||
"\n", | ||
"__Return Arguments:__\n", | ||
"1. __data__ as ndarray/dataframe of shape $(178,13)$\n", | ||
"1. __target__ as ndarray/series of shape $(178,)$\n", | ||
"1. __feature_names__ as a list of the names of the dataset columns\n", | ||
"\n", | ||
"You could load and use the __wine dataset__ as follows:\n", | ||
"```python\n", | ||
"from sklearn.datasets import load_wine\n", | ||
"# Load the Wine dataset\n", | ||
"wine = load_wine()\n", | ||
"data = wine.data\n", | ||
"target = wine.target\n", | ||
"feature_names = wine.feature_names\n", | ||
"```\n", | ||
"\n", | ||
"The __data pre-processing / standardization__ is being performed as follows:\n", | ||
"```python\n", | ||
"# Standardize the data (mean=0, variance=1)\n", | ||
"mean = np.mean(data, axis=0)\n", | ||
"std_dev = np.std(data, axis=0)\n", | ||
"data_standardized = (data - mean) / std_dev\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Principal Component Analysis / PCA\n", | ||
"1. It is a class belonging to the __```sklearn.decomposition```__.\n", | ||
"1. it performs __linear dimensionality reduction__ using __Singular Value Decomposition / SVD of the data to __project__ it to a __lower dimensional__ space.\n", | ||
"<br>\n", | ||
"<br>\n", | ||
"Its __key parameters__ are:\n", | ||
"1. __`n_components1:__ This parameter specifies the number of principal components to retain. You can set it to an integer or a float between 0.0-1.0, indicating the __desired variance__ explained.\n", | ||
"1. __`whiten`:__ When set to __'True'__. it whitens the data (decorrelates features) during the transformation.\n", | ||
"\n", | ||
"Its __output arguments__ are:\n", | ||
"1. __`components_`:__ This attribute contains the principal components themselves as row vectors.\n", | ||
"1. __`explained variance ratio_`:__ This attribute provides the ratio of variance explained by each of the selected components.\n", | ||
"\n", | ||
"Its __methods__ are:\n", | ||
"1. __`fit(X)`:__This method computes the principal components for the input data __`X`__.\n", | ||
"1. __`transform(X)`:__ It transforms the input data __`X`__ into the new coordinate system defined by the __principal components__.\n", | ||
"1. __`inverse_transform(X_reduced)`:__ This method allows you to transform reduced data back to the original feature space.\n", | ||
"\n", | ||
"__Useful Hint:__ We can use the __`fit`__ and __`transform`__ methods of PCA simultaneously, when we want to __perform dimensionality reduction__ and __obtain the transformed data__ in __one step__. The __`fit_transform`__ method is available for this purpose, combining the __fitting and transformation steps__ into a __single call__as follows:\n", | ||
"```python\n", | ||
"from sklearn.decomposition import PCA\n", | ||
"\n", | ||
"# Create a PCA object with the desired number of components\n", | ||
"pca = PCA(n_components=2)\n", | ||
"\n", | ||
"# Fit and transform the data in one step\n", | ||
"X_reduced = pca.fit_transform(X)\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 3D Visualization of the Principal Components\n", | ||
"The __`mpl_tookkits.mplot3d`__ module includes classes and functions for creating various types of 3D plots. We import the __`Axes3D`__ class from the __`mpl_tookkits.mplot3d`__ module. This class is used to __create 3D axes__ for three-dimensional plots:\n", | ||
"```python\n", | ||
"from mpl_toolkits.mplot3d import Axes3D\n", | ||
"```\n", | ||
"The following line of code is used to create a new figure with size of 8 inches in width and 6 inches in height:\n", | ||
"```python\n", | ||
"plt.figure(figsize=(8, 6))\n", | ||
"```\n", | ||
"The __```plt.figure()```__ function is used to __create a new figure__ for the plot.\n", | ||
"The __```figsize```__ allows you to set the dimensions of the figure.\n", | ||
"\n", | ||
"Then, a list called __```colors```__ is created, which contains three color codes. These colors will be used to distinguish data points for different classes in the 3D scatter plot:\n", | ||
"```python\n", | ||
"colors = ['r', 'g', 'b']\n", | ||
"```\n", | ||
"\n", | ||
"Finally, the __```wine.target_names```__ variable contains the class labels or names for the data points in the wine dataset. It __represents__ the different types of wine. This line assigns these class names to the __```targets```__ variable:\n", | ||
"```python\n", | ||
"targets = wine.target_names\n", | ||
"```\n", | ||
"The __```plt.figure()```__ function creates a new, empty figure which is the basis for our visualization.\n", | ||
"The __```ax```__ is a variable that represents the subplot(axis) you are adding to the figure. It's often referred to as the __`axes`__ object. This is where you will create and customize your 3D plot.\n", | ||
"\n", | ||
"The __```ax = fig.add_subplot(111, projection='3d')```__ is used to add the 3D subplot:\n", | ||
"1. The __```111```__ part specifies that you are adding a subplot in a grid with one row, one column and it is the first (and only) subplot.\n", | ||
"1. The __```projection='3d' ```__ part tells Matplotlib that the subplot should be in 3D projection, which is necessary for creating 3D plots. The __`projection`__ parameter is set to __`3D`__ to enable 3D plotting capabilities. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# How to create a 3D scatter plot for visualizing the principal components\n", | ||
"1. Firstly, we initiate a __`for loop`__ that iterates over pairs of __`target`__ and __`color`__ obtained by __zipping__ two __lists__: __`targets`__ and __`colors`__. The `targets`__ list contains the target class names and __`colors`__ list contains the corresponding colors for the scatter plot.\n", | ||
"```python\n", | ||
"for target, color in zip(targets, colors):\n", | ||
"```\n", | ||
"1. The line __```indices_to_keep = reduced_df['Target'] == wine.target_names.tolist().index(target)```__ creates a boolean mask __```indices_to_keep```__ that is __`True`__ for rows in the DataFrame __`reduced_df`__, where the __`Target`__ column (the original class labels) matches the index of the current __`target`__ class in the list of class names __`wine.target_names`__. It helps filter the data points belonging to the current target class.\n", | ||
"1. The following lines generate a scatter plot for the data points belonging to the current __`target`__ class. It uses the 3D axes __`ax`__ for plotting: \n", | ||
"```python\n", | ||
"ax.scatter(reduced_df.loc[indices_to_keep, 'PC1'],\n", | ||
" reduced_df.loc[indices_to_keep, 'PC2'],\n", | ||
" reduced_df.loc[indices_to_keep, 'PC3'],\n", | ||
" c=color,\n", | ||
" label=target)\n", | ||
"```\n", | ||
"1. The code segment \n", | ||
"```python\n", | ||
"reduced_df.loc[indices_to_keep, 'PC1'],\n", | ||
" reduced_df.loc[indices_to_keep, 'PC2'],\n", | ||
" reduced_df.loc[indices_to_keep, 'PC3'],\n", | ||
"```\n", | ||
"select the data points (PC1, PC2, and PC3) for the current target class using the boolean mask __`indices_to_keep`__.\n", | ||
"1. The __`c=color`__ specifies the color of the points.\n", | ||
"1. The __`label=target`__ provides a label for the legend." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Exercise 1\n", | ||
"In this exercise you will use the code and knowledge from the aforementioned cells to perform dimensionality reduction in the Wine Dataset. More specifically, you will perform the following steps:\n", | ||
"1. Load the Wine dataset and differentiate the features and the output in the variables __`data`__ and __`target`__ respectively.\n", | ||
"1. Then standardize the data so as to have zero mean value and unit variance.\n", | ||
"1. Declare an object of the PCA class from the sklearn.decomposition. This model will be initialized so as to retain 3 components.\n", | ||
"1. Then use the __`fit_transform()`__ function to fit and transform the data to a new __`reduced_data`__ function.\n", | ||
"1. Then create a __`reduced_df`__ DataFrame to visualize the reduced data. \n", | ||
"1. Once you created the DataFrame, you should add now the target variable to the DataFrame.\n", | ||
"1. Then, plot this DataFrame in 3D (You may use the previous code for this).\n", | ||
"1. Estimate the explained variance ratio by using the __`explained_variance_ratio`__ attribute.\n", | ||
"1. Print the explained variance ratio for each component and the total variance explained." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Place your code here" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# A goodbye Exercise\n", | ||
"This is an __optional__ exercise and a way to thank you for your collaboration this year.\n", | ||
"Although it is a __strange gift__, it serves to integrate the knowledge taught in a single exercise/project.\n", | ||
"Hopefully, it would be a crash test for the succession of the lessons and workshops.\n", | ||
"More specifically:\n", | ||
"1. Load the Breast Cancer dataset from sklearn.datasets.\n", | ||
"1. Split the dataset into training and testing sets\n", | ||
"1. Train and evaluate k-NN and SVM models on the original data\n", | ||
"1. Define a function to perform PCA and evaluate models\n", | ||
"1. Evaluate and return results (accuracy, precision, recall, F1)\n", | ||
"1. Compare the results before and after PCA\n", | ||
"1. Display the results" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# References\n", | ||
"[UV Irvine Machine Learning Repository](https://archive.ics.uci.edu/)\n", | ||
"<br>\n", | ||
"[PCA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |