diff --git a/CMP3751 Machine Learning/workshops/Week 4/Workshop 4.ipynb b/CMP3751 Machine Learning/workshops/Week 4/Workshop 4.ipynb new file mode 100644 index 0000000..68be281 --- /dev/null +++ b/CMP3751 Machine Learning/workshops/Week 4/Workshop 4.ipynb @@ -0,0 +1,389 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " __Machine Learning Workshp 4 : The k Nearest Neighbors (kNN) classifier__.\n", + "
\n", + "
\n", + "
\n", + " Welecome to our fourth workshop!
\n", + "
\n", + "In this workshop we will investigate the fundamentals of the kNN classifier and then we will enhance its functionality by finding the optimal k value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__Table of Contents__\n", + "1. [Background: The kNN Classifier](#Background:-The-kNN-Classifier)\n", + "1. [Exercise 1: Classification](#Exercise-1:-Classification)\n", + "1. [The function make_moons](#The-function-make_moons)\n", + "1. [The numpy.meshgrid function](#The-numpy.meshgrid-function)\n", + "1. [The numpy.ravel() function](#The-numpy.ravel()-function)\n", + "1. [Decision boundary visualization](#Decision-boundary-visualization)\n", + "1. [Exercise 2: Nonlinear classification](#Exercise-2:-Nonlinear-classification)\n", + "1. [kNN Regression](#kNN-Regression)\n", + "1. [Sorting the independent variable](#Sorting-the-independent-variable)\n", + "1. [Nonlinear Data Generation Tips](#Nonlinear-Data-Generation-Tips)\n", + "1. [k-fold cross validation function](#k-fold-cross-validation-function)\n", + "1. [Exercise 3 / Optional](#Exercise-3-/-Optional)\n", + "1. [References](#References)\n", + "1. [Evaluation](#Evaluation)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Background: The kNN Classifier\n", + "It is a classifier implementing the __k-nearest neighbors__ vote.\n", + "Its parameters are:\n", + "1. __n-neighbors__: The number of neighbors to use (int). The default value is 5.\n", + "1. __weights__: {'uniform','distance'}: When we select 'uniform', all points in each neighborhood are weighted equally. The 'distance' option weights points by the inverse of their distance so that the closer neighbors of a query point will have a greater influence than the ones that are further away.\n", + "1. __algorithm__:{'ball_tree','kd_tree','brute','auto'} It is the algorithm used to compute the nearest neighbors. The __auto__ option will attempt to decide the most appropriate algorithm based on the values passed to __fit method__. \n", + "\n", + "Its attributes are:\n", + "1. Class labels known to the classifier: __classes_: array of shape(n_classes)__\n", + "1. Number of features seen during fit: __n_features_in: int__\n", + "1. Names of features seen during fit: __feature_names_in_: ndarray of shape (n_features_in_)__\n", + "\n", + "Below there is the Python code for a kNN Classification example:\n", + "```python\n", + "import numpy as np\n", + "from sklearn.neighbors import KNeighborsClassifier\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Generate sample data with 10 instances per class\n", + "np.random.seed(0)\n", + "\n", + "classes_ = [0] * 10 + [1] * 10\n", + "feature_names_in = [\"Feature1\", \"Feature2\"]\n", + "features_in = np.array([[1.2, 3.4], [2.3, 4.0], [1.9, 3.6], [4.5, 5.5], [5.0, 6.3],\n", + " [8.2, 7.4], [7.3, 8.0], [8.9, 7.6], [9.5, 9.5], [10.0, 10.3],\n", + " [2.5, 2.7], [3.0, 3.5], [3.9, 3.0], [5.5, 6.0], [6.2, 5.8],\n", + " [7.2, 6.7], [6.5, 7.0], [7.0, 7.2], [6.0, 6.0], [5.2, 5.4]])\n", + "\n", + "# Initialize the kNN classifier with the number of neighbors (k)\n", + "k = 3\n", + "knn = KNeighborsClassifier(n_neighbors=k)\n", + "\n", + "# Fit the model using your data\n", + "knn.fit(features_in, classes_)\n", + "\n", + "# Example data point for prediction\n", + "new_data_point = [[3.1, 4.2]]\n", + "\n", + "# Predict the class for the new data point\n", + "predicted_class = knn.predict(new_data_point)\n", + "\n", + "# Visualize the data\n", + "plt.figure(figsize=(8, 6))\n", + "plt.scatter(features_in[:10, 0], features_in[:10, 1], c='b', label='Class 0', marker='o')\n", + "plt.scatter(features_in[10:, 0], features_in[10:, 1], c='r', label='Class 1', marker='x')\n", + "plt.scatter(new_data_point[0][0], new_data_point[0][1], c='g', label='New Data Point', marker='s')\n", + "plt.xlabel(feature_names_in[0],fontsize=14)\n", + "plt.ylabel(feature_names_in[1], fontsize=14)\n", + "plt.grid(True)\n", + "plt.legend()\n", + "plt.title(\"kNN Classification with Sample Data\")\n", + "plt.show()\n", + "\n", + "print(\"Feature Names:\", feature_names_in)\n", + "print(\"New Data Point:\", new_data_point[0])\n", + "print(f\"The predicted class for {new_data_point[0]} is: {predicted_class[0]}\")\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 1: Classification\n", + "In this assignment we will work with the Iris dataset and we will perform kNN classification as follows:\n", + "1. Load the Iris dataset and create the X variable containing the independent variables and the y dataset containing the class label.\n", + "1. Extract the independent variables (features) from the Iris dataset and assign them to the sepal length, sepal_width, petal_length and petal_width.\n", + "1. Create a scatterplot for sepal length vs, sepal width.\n", + "1. Split the data into training and test sets.\n", + "1. Scale the features using StandardScaler\n", + "1. Perform the kNN classification and report the obtained accuracy for a variety of number of neighbors (e.g. 2-7).\n", + "1. Which is the number of neighbours that gave you the best accuracy?" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Place your code here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The function make_moons\n", + "1. It is a simple toy dataset to demonstrate either clustering or classification algorithms.\n", + "1. It is a function for making two interleaving half circles.\n", + "1. It is particularly useful to deal with __nonlinear decision boundaries__.\n", + "
\n", + "
\n", + "1. Its __parameters are__:\n", + " 1. n_samples: If __int__ the total number of points generated. If two-element tuple, the number of points in each two moons.\n", + " 1. shuffle: Whether to __shuffle the samples__ (bool, default=True).\n", + " 1. noise: Standard deviation of __Gaussian noise__ added to the data (float, default=None).\n", + " 1. random_state (default=None): Determines __random number generation__ for __dataset shuffling and noise__. It passes an __int__ for __reproducible output__ across multiple function calls.\n", + "
\n", + "
\n", + "1. Its __output__ is:\n", + " 1. The generated samples __X__ :ndarray of shape (n_samples,2).\n", + " 1. The __integer labels (0 or 1)__ for class membership of each sample __y__: ndarray of shape n_samples.\n", + "\n", + "For example you may call the function as follows:\n", + "```python\n", + "# Generate synthetic data with a nonlinear decision boundary\n", + "X, y = make_moons(n_samples=200, noise=0.3, random_state=42)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The numpy.meshgrid function\n", + "1. It returns a __list of coordinate matrices__ from __coordinate vectors__.\n", + "1. It is used to __create a rectangular grid__ out of __two given one-dimensional arrays__ representing the Cartesian indexing.\n", + "1. If the x-axis ranges from $-x1...x1$ and the y-axis ranges from $-y1...y1$ integer (for simplicity) valuesm then there are a __total__ of $(2*x1+1) * (2*y1+1)$ points marked in the figure each with a X-coordinate and a Y-coordinate.\n", + "1. The ```numpy.meshgrid()``` function returns two 2-Dimensional arrays representing the X and Y coordinates of all the points.\n", + "\n", + "The following Python code segment uses the ```numpy.meshgrid``` function to create a grid xith x-values from -3 to 3 and y-values from -5 to 5 with step 0.5:\n", + "```python\n", + "import numpy as np\n", + "\n", + "# Define the range and step size\n", + "x_range = np.arange(-3, 3.5, 0.5)\n", + "y_range = np.arange(-5, 5.5, 0.5)\n", + "\n", + "# Create the mesh grid\n", + "xx, yy = np.meshgrid(x_range, y_range)\n", + "\n", + "# Print the generated grids\n", + "print(\"xx (X coordinates):\\n\", xx)\n", + "print(\"\\nyy (Y coordinates):\\n\", yy)\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The numpy.ravel() function\n", + "1. The ```numpy.ravel()``` function returns a contiguous flattened array.\n", + "1. It is an 1-D array with all the inout array elements and with the same type as the input.\n", + "1. It is equivalent to ```reshape(-1)```.\n", + "1. The following code segment flattens a 2D array to a 1D:\n", + "```python\n", + "import numpy as np\n", + "array=np.arange(15).reshape(3,5)\n", + "print(array)\n", + "arr1=array.ravel()\n", + "print(\"Flattening array: \",arr1)\n", + "```\n", + "1. The ```numpy.c_[xx.ravel(),yy.ravel()]``` is a NumPy operation that concatenates the flattened X and Y coordinates into a single 2D array." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Decision boundary visualization\n", + "We use the following code:\n", + "```python\n", + "plt.contourf(xx, yy, Z, alpha=0.8)\n", + "```\n", + "1. The ```contourf``` function from ```matplotlib``` is used to create __filled contour plots__.\n", + "1. A __filled contour plot__ is a graphical representation of a 3D surface where __regions of different values__ are filled with __different colors__.\n", + "1. It is often used in __machine learning__ to __visualize decision boundaries__.\n", + "1. The ```xx``` and ```yy``` are the __X__ and __Y__ coordinate grids created by ```np.meshgrid```. So, they __define__ the __grid of points__ in the __2D feature space__, where ```Z``` will be evaluated.\n", + "1. ```Z``` is the array of __predicted class labels__, typically generated by a machine learning model, for __each point in the feature space grid__.\n", + "1. The ```contourf``` function will use these labels to __determine how to fill different regions__ with colors based on the __class predictions__.\n", + "1. The __alpha parameter__ specifies the __opacity of the filled regions__ in the contour plot. It varies between __0 (transparent)__ and __1 (opaque)__.\n", + "1. The ```plt.cm``` is an __attribute__ of the ```matplotlib``` library that provides access to a variety of __built-in colormaps__. It is used to __specify the colormap__ you wish to apply to your plot." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 2: Nonlinear classification\n", + "In this exercise you will use the KNN classifier to find the optimal classification settings for nonlinear classification:\n", + "1. Firstly, you will create 200 synthetic instances with a nonlinear decision boundary through the make_moons function. Set the noise parameter (standard deviation of Gaussian noise) at 0.3.\n", + "1. Split the data into training and test sets.\n", + "1. Initialize variables to store the best k and corresponding accuracy.\n", + "1. Define a range of k values to test.\n", + "1. For each value within that range, calculate the accuracy on the test set, check if this k value performs the best fit so far and update the best k parameter accordingly.\n", + "1. Once you find the optimal parameter, train with that the kNN classifiee.\n", + "1. Make predictions on the test data.\n", + "1. Calculate accuracy, precision and recall.\n", + "1. Visualize the data points with the class information and the decision boundary.\n", + "1. Display the evaluation metrics." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# kNN Regression\n", + "1. We use the __[sklearn.neighbors.KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)__\n", + "1. It performs regression analysis based on k-nearest neighbors.\n", + "1. The __kNN regression algorithm__ relies on the principle that __similar data points__ should have __similar target values__.\n", + "1. It makes predictions by finding the __k nearest data points__ to the __query point__ (the point for which you want to make a prediction).\n", + "1. For __regression__, instead of __predicting a class label__, it predicts the __target value__ by taking the __average (simple or weighted)__ of the target values of the __k nearest neighbors__.\n", + "1. The __predicted value__ is a __continuous__ number.\n", + "1. It is __useful__ in scenarios where the relationship between the input features and the target variable is __not strictly linear__ and can have __complex patterns__." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Sorting the independent variable\n", + "1. We use the sort() function from the Numpy library.\n", + "1. Its syntax is ```np.sort(a, axis=-1,kind=None)```\n", + "1. It returns a __sorted copy__ of an array.\n", + "1. The parameter __a__ denotes the array to be sorted.\n", + "1. The optional parameter __axis__ (int/None), denotes the axis along which to sort. If __None__, the array is flattened before sorting. The __default__ is $-1$, which sorts along the last axis.\n", + "1. The optional parameter __kind__ {'quicksort','mergesort','heapsort','stable'}, defines the __sorting algorithm__. The 'quicksort' is the __default__ option.\n", + "1. The following code segment is used to sort the values in the array ```X``` according to the first axis, effectively sorting the data points based on their __X-coordinate__: \n", + "```python\n", + "X = np.sort(5 * np.random.rand(points, 1), axis=0)\n", + "```\n", + "1. This is __not a strictly necessary procedure__ but it has been done __to ensure that the data points are organized in a predictable way__ (ascending order based on their X-coordinate)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Nonlinear Data Generation Tips\n", + "1. Firstly, we use the ```np.sin(X)``` function to return a NumPy array with a __shape of ```(n,1)```__, where ```n``` is the __number of data points__.\n", + "1. This __shape__ represents a __column vector__.\n", + "1. However, the ```y``` variable is expected to be a __1D array / flat vector__ with a shape of __```n```__.\n", + "1. So, we use the __```ravel()```__ to __flatten__ the output of __```np.sin(x)```__ and __create a 1D array__ that matches the __expected shape__ of the ```y``` variable: \n", + "```python\n", + "y=np.sin(X).ravel()+noise\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# k-fold cross validation function\n", + "1. The __```sklearn.model_selection.cross_val_score```__ is used to evaluate a score by cross validation.\n", + "1. The __estimator parameter__ (already developed classification/regression model) is the __object__ to use for __data fitting__.\n", + "1. The __```X``` parameter__ (array,list) is the __data to fit__.\n", + "1. Its __shape__ is __```(n_samples,n_features)```__.\n", + "1. The __```y``` parameter__ is the __target variable__ used to __predict__ in the case of __supervised learning__.\n", + "1. The __```cv``` parameter defines__ the __cross-validation strategy__. It can be either an __integer__ (e.g. 10) or a __cross-validation object__ ('kFold', 'StratifiedkFold', 'TimeSeriesSplit'). If not specified, a 5-fold cross-validation is used __by default__.\n", + "1. The __```scoring``` parameter__ specifies the __scoring metric__ used to __evaluate__ the __model's performance__. It can be a __string__ with the name of a __built-in metric__ or a __custom scoring__ function.\n", + "1. The __```cross_val_score```__ function returns an __array of scores__, where each score __corresponds__ to one of the __cross-validation-folds__.\n", + "1. You can __compute statistics__ on these scores, such as the __mean score__ or __standard deviation__:\n", + "```python\n", + "knn_reg = KNeighborsRegressor(n_neighbors=k)\n", + "r2_scores = cross_val_score(knn_reg, X_train, y_train, cv=5, scoring='r2')\n", + "mean_r2 = np.mean(r2_scores)\n", + " \n", + " if mean_r2 > best_r2:\n", + " best_k = k\n", + " best_r2 = mean_r2\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 3 / Optional\n", + "This exercise (optional assignment) is for the interested ones that wish to investigate how the kNN Algorithm can be easily used as a powerful regression algorithm for nonlinear problems. More specifically, please follow these steps:\n", + "1. Generate 100 synthetic point data with a nonlinear relationship. The X-cordinates would be among 0-5 and the Y-coordinates would follow the relationship $y=sin(x)+ 0.1*noise$. Use the __```np.sort()```__ and __```np.ravel()```__ for better compliance.\n", + "1. __Split__ the data into training and test sets.\n", + "1. Perform __linear regression__.\n", + "1. Develop a __```for loop```__ to __investigate the optimal k-value__ for the kNN algorithm. The loop should investigate several k value options (e.g. 1-11). The evaluation should be performed according to the __coefficient of determination__.\n", + "1. Once you find the best k-value, use it to train the kNN regression algorithm using the __```KNeighborsRegressor```__.\n", + "1. Use the __test set__ to __predict the y-values__ for __both algorithms__.\n", + "1. __Evaluate both models__ in terms of the __coefficient of determination__.\n", + "1. __Generate__ a range (0-5) of X values for the regression lines.\n", + "1. __Use__ the previous step to predict the __Y values__ for both regression lines.\n", + "1. __Plot__ both __data points__ and __regression lines__.\n", + "1. __Display__ in the screen the evaluation metrics and the optimal k-value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# References\n", + "__[Scikit-learn: The kNN Classification](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)__\n", + "
\n", + "__[Scikit-learn User Guide tor Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html)__\n", + "
\n", + "__[The make_moons function](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)__\n", + "
\n", + "__[Creating 2D grids](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html)__ \n", + "
\n", + "__[Flattening 2D arrays](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html)__\n", + "
\n", + "__[Visualization of decision boundaries](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html)__\n", + "
\n", + "__[How to sort data and arrays](https://numpy.org/doc/stable/reference/generated/numpy.sort.html)__\n", + "
\n", + "__[k-fold cross validation function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)__" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evaluation\n", + "\n", + "Please visit the following link for __[Workshop 4 Evaluation](https://app.wooclap.com/PIHHOO?from=event-page)__\n", + "
\n", + "Tell us your opinion about this workshop and how we could become better in the next one.\n", + "Your opinion matters!!!)__" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}