From 1a0c9f0e6e952f15cdaf9456c9e2524f03cee613 Mon Sep 17 00:00:00 2001 From: "Christos A. Frantzidis" Date: Mon, 4 Nov 2024 12:44:26 +0000 Subject: [PATCH] Adding workshop 6 notebook --- .../workshops/Week 7/Workshop 6.ipynb | 239 ++++++++++++++++++ 1 file changed, 239 insertions(+) create mode 100644 CMP3751 Machine Learning/workshops/Week 7/Workshop 6.ipynb diff --git a/CMP3751 Machine Learning/workshops/Week 7/Workshop 6.ipynb b/CMP3751 Machine Learning/workshops/Week 7/Workshop 6.ipynb new file mode 100644 index 0000000..fcaf163 --- /dev/null +++ b/CMP3751 Machine Learning/workshops/Week 7/Workshop 6.ipynb @@ -0,0 +1,239 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " __Machine Learning Workshp 6 : Dimensionality Reduction__.\n", + "
\n", + "
\n", + "
\n", + " Welecome to our sixth workshop!
\n", + "\n", + "In this workshop we will investigate the fundamentals of the __Dimensionality Reduction__ and how to apply its concept through Principal Component Analysis." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The wine dataset\n", + "It is a classic and very easy __multi-class__ classification dataset.\n", + "It consists of __three (3)__ classes, __178 total instances__ and __13 real, positive__ features:\n", + "1. It is commonly associated with the __UCL Machine Learning Repository__.\n", + "1. The classes represent __three different types__ of wine.\n", + "1. There are __13 different chemical measurements (features)__ describing each wine sample in terms of alcohol content, malic acid concentration, color intensity, etc.\n", + "1. The dataset typically consists of __178 samples__, with each sample corresponding to one bottle of wine.\n", + "1. Typically, the data in the wine dataset is __preprocessed__ by __standardizing__ the features to have __zero mean__ and __unit variance__. This preprocessing step __ensures__ that each __feature contributes equally__ to the analysis.\n", + "\n", + "__Return Arguments:__\n", + "1. __data__ as ndarray/dataframe of shape $(178,13)$\n", + "1. __target__ as ndarray/series of shape $(178,)$\n", + "1. __feature_names__ as a list of the names of the dataset columns\n", + "\n", + "You could load and use the __wine dataset__ as follows:\n", + "```python\n", + "from sklearn.datasets import load_wine\n", + "# Load the Wine dataset\n", + "wine = load_wine()\n", + "data = wine.data\n", + "target = wine.target\n", + "feature_names = wine.feature_names\n", + "```\n", + "\n", + "The __data pre-processing / standardization__ is being performed as follows:\n", + "```python\n", + "# Standardize the data (mean=0, variance=1)\n", + "mean = np.mean(data, axis=0)\n", + "std_dev = np.std(data, axis=0)\n", + "data_standardized = (data - mean) / std_dev\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Principal Component Analysis / PCA\n", + "1. It is a class belonging to the __```sklearn.decomposition```__.\n", + "1. it performs __linear dimensionality reduction__ using __Singular Value Decomposition / SVD of the data to __project__ it to a __lower dimensional__ space.\n", + "
\n", + "
\n", + "Its __key parameters__ are:\n", + "1. __`n_components1:__ This parameter specifies the number of principal components to retain. You can set it to an integer or a float between 0.0-1.0, indicating the __desired variance__ explained.\n", + "1. __`whiten`:__ When set to __'True'__. it whitens the data (decorrelates features) during the transformation.\n", + "\n", + "Its __output arguments__ are:\n", + "1. __`components_`:__ This attribute contains the principal components themselves as row vectors.\n", + "1. __`explained variance ratio_`:__ This attribute provides the ratio of variance explained by each of the selected components.\n", + "\n", + "Its __methods__ are:\n", + "1. __`fit(X)`:__This method computes the principal components for the input data __`X`__.\n", + "1. __`transform(X)`:__ It transforms the input data __`X`__ into the new coordinate system defined by the __principal components__.\n", + "1. __`inverse_transform(X_reduced)`:__ This method allows you to transform reduced data back to the original feature space.\n", + "\n", + "__Useful Hint:__ We can use the __`fit`__ and __`transform`__ methods of PCA simultaneously, when we want to __perform dimensionality reduction__ and __obtain the transformed data__ in __one step__. The __`fit_transform`__ method is available for this purpose, combining the __fitting and transformation steps__ into a __single call__as follows:\n", + "```python\n", + "from sklearn.decomposition import PCA\n", + "\n", + "# Create a PCA object with the desired number of components\n", + "pca = PCA(n_components=2)\n", + "\n", + "# Fit and transform the data in one step\n", + "X_reduced = pca.fit_transform(X)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 3D Visualization of the Principal Components\n", + "The __`mpl_tookkits.mplot3d`__ module includes classes and functions for creating various types of 3D plots. We import the __`Axes3D`__ class from the __`mpl_tookkits.mplot3d`__ module. This class is used to __create 3D axes__ for three-dimensional plots:\n", + "```python\n", + "from mpl_toolkits.mplot3d import Axes3D\n", + "```\n", + "The following line of code is used to create a new figure with size of 8 inches in width and 6 inches in height:\n", + "```python\n", + "plt.figure(figsize=(8, 6))\n", + "```\n", + "The __```plt.figure()```__ function is used to __create a new figure__ for the plot.\n", + "The __```figsize```__ allows you to set the dimensions of the figure.\n", + "\n", + "Then, a list called __```colors```__ is created, which contains three color codes. These colors will be used to distinguish data points for different classes in the 3D scatter plot:\n", + "```python\n", + "colors = ['r', 'g', 'b']\n", + "```\n", + "\n", + "Finally, the __```wine.target_names```__ variable contains the class labels or names for the data points in the wine dataset. It __represents__ the different types of wine. This line assigns these class names to the __```targets```__ variable:\n", + "```python\n", + "targets = wine.target_names\n", + "```\n", + "The __```plt.figure()```__ function creates a new, empty figure which is the basis for our visualization.\n", + "The __```ax```__ is a variable that represents the subplot(axis) you are adding to the figure. It's often referred to as the __`axes`__ object. This is where you will create and customize your 3D plot.\n", + "\n", + "The __```ax = fig.add_subplot(111, projection='3d')```__ is used to add the 3D subplot:\n", + "1. The __```111```__ part specifies that you are adding a subplot in a grid with one row, one column and it is the first (and only) subplot.\n", + "1. The __```projection='3d' ```__ part tells Matplotlib that the subplot should be in 3D projection, which is necessary for creating 3D plots. The __`projection`__ parameter is set to __`3D`__ to enable 3D plotting capabilities. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to create a 3D scatter plot for visualizing the principal components\n", + "1. Firstly, we initiate a __`for loop`__ that iterates over pairs of __`target`__ and __`color`__ obtained by __zipping__ two __lists__: __`targets`__ and __`colors`__. The `targets`__ list contains the target class names and __`colors`__ list contains the corresponding colors for the scatter plot.\n", + "```python\n", + "for target, color in zip(targets, colors):\n", + "```\n", + "1. The line __```indices_to_keep = reduced_df['Target'] == wine.target_names.tolist().index(target)```__ creates a boolean mask __```indices_to_keep```__ that is __`True`__ for rows in the DataFrame __`reduced_df`__, where the __`Target`__ column (the original class labels) matches the index of the current __`target`__ class in the list of class names __`wine.target_names`__. It helps filter the data points belonging to the current target class.\n", + "1. The following lines generate a scatter plot for the data points belonging to the current __`target`__ class. It uses the 3D axes __`ax`__ for plotting: \n", + "```python\n", + "ax.scatter(reduced_df.loc[indices_to_keep, 'PC1'],\n", + " reduced_df.loc[indices_to_keep, 'PC2'],\n", + " reduced_df.loc[indices_to_keep, 'PC3'],\n", + " c=color,\n", + " label=target)\n", + "```\n", + "1. The code segment \n", + "```python\n", + "reduced_df.loc[indices_to_keep, 'PC1'],\n", + " reduced_df.loc[indices_to_keep, 'PC2'],\n", + " reduced_df.loc[indices_to_keep, 'PC3'],\n", + "```\n", + "select the data points (PC1, PC2, and PC3) for the current target class using the boolean mask __`indices_to_keep`__.\n", + "1. The __`c=color`__ specifies the color of the points.\n", + "1. The __`label=target`__ provides a label for the legend." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 1\n", + "In this exercise you will use the code and knowledge from the aforementioned cells to perform dimensionality reduction in the Wine Dataset. More specifically, you will perform the following steps:\n", + "1. Load the Wine dataset and differentiate the features and the output in the variables __`data`__ and __`target`__ respectively.\n", + "1. Then standardize the data so as to have zero mean value and unit variance.\n", + "1. Declare an object of the PCA class from the sklearn.decomposition. This model will be initialized so as to retain 3 components.\n", + "1. Then use the __`fit_transform()`__ function to fit and transform the data to a new __`reduced_data`__ function.\n", + "1. Then create a __`reduced_df`__ DataFrame to visualize the reduced data. \n", + "1. Once you created the DataFrame, you should add now the target variable to the DataFrame.\n", + "1. Then, plot this DataFrame in 3D (You may use the previous code for this).\n", + "1. Estimate the explained variance ratio by using the __`explained_variance_ratio`__ attribute.\n", + "1. Print the explained variance ratio for each component and the total variance explained." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Place your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# A goodbye Exercise\n", + "This is an __optional__ exercise and a way to thank you for your collaboration this year.\n", + "Although it is a __strange gift__, it serves to integrate the knowledge taught in a single exercise/project.\n", + "Hopefully, it would be a crash test for the succession of the lessons and workshops.\n", + "More specifically:\n", + "1. Load the Breast Cancer dataset from sklearn.datasets.\n", + "1. Split the dataset into training and testing sets\n", + "1. Train and evaluate k-NN and SVM models on the original data\n", + "1. Define a function to perform PCA and evaluate models\n", + "1. Evaluate and return results (accuracy, precision, recall, F1)\n", + "1. Compare the results before and after PCA\n", + "1. Display the results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# References\n", + "[UV Irvine Machine Learning Repository](https://archive.ics.uci.edu/)\n", + "
\n", + "[PCA documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}