diff --git a/CMP3751 Machine Learning/workshops/Week 4/Workshop 4.ipynb b/CMP3751 Machine Learning/workshops/Week 4/Workshop 4.ipynb
new file mode 100644
index 0000000..68be281
--- /dev/null
+++ b/CMP3751 Machine Learning/workshops/Week 4/Workshop 4.ipynb
@@ -0,0 +1,389 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ " __Machine Learning Workshp 4 : The k Nearest Neighbors (kNN) classifier__.\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " Welecome to our fourth workshop!
\n",
+ "
\n",
+ "In this workshop we will investigate the fundamentals of the kNN classifier and then we will enhance its functionality by finding the optimal k value."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__Table of Contents__\n",
+ "1. [Background: The kNN Classifier](#Background:-The-kNN-Classifier)\n",
+ "1. [Exercise 1: Classification](#Exercise-1:-Classification)\n",
+ "1. [The function make_moons](#The-function-make_moons)\n",
+ "1. [The numpy.meshgrid function](#The-numpy.meshgrid-function)\n",
+ "1. [The numpy.ravel() function](#The-numpy.ravel()-function)\n",
+ "1. [Decision boundary visualization](#Decision-boundary-visualization)\n",
+ "1. [Exercise 2: Nonlinear classification](#Exercise-2:-Nonlinear-classification)\n",
+ "1. [kNN Regression](#kNN-Regression)\n",
+ "1. [Sorting the independent variable](#Sorting-the-independent-variable)\n",
+ "1. [Nonlinear Data Generation Tips](#Nonlinear-Data-Generation-Tips)\n",
+ "1. [k-fold cross validation function](#k-fold-cross-validation-function)\n",
+ "1. [Exercise 3 / Optional](#Exercise-3-/-Optional)\n",
+ "1. [References](#References)\n",
+ "1. [Evaluation](#Evaluation)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Background: The kNN Classifier\n",
+ "It is a classifier implementing the __k-nearest neighbors__ vote.\n",
+ "Its parameters are:\n",
+ "1. __n-neighbors__: The number of neighbors to use (int). The default value is 5.\n",
+ "1. __weights__: {'uniform','distance'}: When we select 'uniform', all points in each neighborhood are weighted equally. The 'distance' option weights points by the inverse of their distance so that the closer neighbors of a query point will have a greater influence than the ones that are further away.\n",
+ "1. __algorithm__:{'ball_tree','kd_tree','brute','auto'} It is the algorithm used to compute the nearest neighbors. The __auto__ option will attempt to decide the most appropriate algorithm based on the values passed to __fit method__. \n",
+ "\n",
+ "Its attributes are:\n",
+ "1. Class labels known to the classifier: __classes_: array of shape(n_classes)__\n",
+ "1. Number of features seen during fit: __n_features_in: int__\n",
+ "1. Names of features seen during fit: __feature_names_in_: ndarray of shape (n_features_in_)__\n",
+ "\n",
+ "Below there is the Python code for a kNN Classification example:\n",
+ "```python\n",
+ "import numpy as np\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "# Generate sample data with 10 instances per class\n",
+ "np.random.seed(0)\n",
+ "\n",
+ "classes_ = [0] * 10 + [1] * 10\n",
+ "feature_names_in = [\"Feature1\", \"Feature2\"]\n",
+ "features_in = np.array([[1.2, 3.4], [2.3, 4.0], [1.9, 3.6], [4.5, 5.5], [5.0, 6.3],\n",
+ " [8.2, 7.4], [7.3, 8.0], [8.9, 7.6], [9.5, 9.5], [10.0, 10.3],\n",
+ " [2.5, 2.7], [3.0, 3.5], [3.9, 3.0], [5.5, 6.0], [6.2, 5.8],\n",
+ " [7.2, 6.7], [6.5, 7.0], [7.0, 7.2], [6.0, 6.0], [5.2, 5.4]])\n",
+ "\n",
+ "# Initialize the kNN classifier with the number of neighbors (k)\n",
+ "k = 3\n",
+ "knn = KNeighborsClassifier(n_neighbors=k)\n",
+ "\n",
+ "# Fit the model using your data\n",
+ "knn.fit(features_in, classes_)\n",
+ "\n",
+ "# Example data point for prediction\n",
+ "new_data_point = [[3.1, 4.2]]\n",
+ "\n",
+ "# Predict the class for the new data point\n",
+ "predicted_class = knn.predict(new_data_point)\n",
+ "\n",
+ "# Visualize the data\n",
+ "plt.figure(figsize=(8, 6))\n",
+ "plt.scatter(features_in[:10, 0], features_in[:10, 1], c='b', label='Class 0', marker='o')\n",
+ "plt.scatter(features_in[10:, 0], features_in[10:, 1], c='r', label='Class 1', marker='x')\n",
+ "plt.scatter(new_data_point[0][0], new_data_point[0][1], c='g', label='New Data Point', marker='s')\n",
+ "plt.xlabel(feature_names_in[0],fontsize=14)\n",
+ "plt.ylabel(feature_names_in[1], fontsize=14)\n",
+ "plt.grid(True)\n",
+ "plt.legend()\n",
+ "plt.title(\"kNN Classification with Sample Data\")\n",
+ "plt.show()\n",
+ "\n",
+ "print(\"Feature Names:\", feature_names_in)\n",
+ "print(\"New Data Point:\", new_data_point[0])\n",
+ "print(f\"The predicted class for {new_data_point[0]} is: {predicted_class[0]}\")\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Exercise 1: Classification\n",
+ "In this assignment we will work with the Iris dataset and we will perform kNN classification as follows:\n",
+ "1. Load the Iris dataset and create the X variable containing the independent variables and the y dataset containing the class label.\n",
+ "1. Extract the independent variables (features) from the Iris dataset and assign them to the sepal length, sepal_width, petal_length and petal_width.\n",
+ "1. Create a scatterplot for sepal length vs, sepal width.\n",
+ "1. Split the data into training and test sets.\n",
+ "1. Scale the features using StandardScaler\n",
+ "1. Perform the kNN classification and report the obtained accuracy for a variety of number of neighbors (e.g. 2-7).\n",
+ "1. Which is the number of neighbours that gave you the best accuracy?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Place your code here"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# The function make_moons\n",
+ "1. It is a simple toy dataset to demonstrate either clustering or classification algorithms.\n",
+ "1. It is a function for making two interleaving half circles.\n",
+ "1. It is particularly useful to deal with __nonlinear decision boundaries__.\n",
+ "
\n",
+ "
\n",
+ "1. Its __parameters are__:\n",
+ " 1. n_samples: If __int__ the total number of points generated. If two-element tuple, the number of points in each two moons.\n",
+ " 1. shuffle: Whether to __shuffle the samples__ (bool, default=True).\n",
+ " 1. noise: Standard deviation of __Gaussian noise__ added to the data (float, default=None).\n",
+ " 1. random_state (default=None): Determines __random number generation__ for __dataset shuffling and noise__. It passes an __int__ for __reproducible output__ across multiple function calls.\n",
+ "
\n",
+ "
\n",
+ "1. Its __output__ is:\n",
+ " 1. The generated samples __X__ :ndarray of shape (n_samples,2).\n",
+ " 1. The __integer labels (0 or 1)__ for class membership of each sample __y__: ndarray of shape n_samples.\n",
+ "\n",
+ "For example you may call the function as follows:\n",
+ "```python\n",
+ "# Generate synthetic data with a nonlinear decision boundary\n",
+ "X, y = make_moons(n_samples=200, noise=0.3, random_state=42)\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# The numpy.meshgrid function\n",
+ "1. It returns a __list of coordinate matrices__ from __coordinate vectors__.\n",
+ "1. It is used to __create a rectangular grid__ out of __two given one-dimensional arrays__ representing the Cartesian indexing.\n",
+ "1. If the x-axis ranges from $-x1...x1$ and the y-axis ranges from $-y1...y1$ integer (for simplicity) valuesm then there are a __total__ of $(2*x1+1) * (2*y1+1)$ points marked in the figure each with a X-coordinate and a Y-coordinate.\n",
+ "1. The ```numpy.meshgrid()``` function returns two 2-Dimensional arrays representing the X and Y coordinates of all the points.\n",
+ "\n",
+ "The following Python code segment uses the ```numpy.meshgrid``` function to create a grid xith x-values from -3 to 3 and y-values from -5 to 5 with step 0.5:\n",
+ "```python\n",
+ "import numpy as np\n",
+ "\n",
+ "# Define the range and step size\n",
+ "x_range = np.arange(-3, 3.5, 0.5)\n",
+ "y_range = np.arange(-5, 5.5, 0.5)\n",
+ "\n",
+ "# Create the mesh grid\n",
+ "xx, yy = np.meshgrid(x_range, y_range)\n",
+ "\n",
+ "# Print the generated grids\n",
+ "print(\"xx (X coordinates):\\n\", xx)\n",
+ "print(\"\\nyy (Y coordinates):\\n\", yy)\n",
+ "\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# The numpy.ravel() function\n",
+ "1. The ```numpy.ravel()``` function returns a contiguous flattened array.\n",
+ "1. It is an 1-D array with all the inout array elements and with the same type as the input.\n",
+ "1. It is equivalent to ```reshape(-1)```.\n",
+ "1. The following code segment flattens a 2D array to a 1D:\n",
+ "```python\n",
+ "import numpy as np\n",
+ "array=np.arange(15).reshape(3,5)\n",
+ "print(array)\n",
+ "arr1=array.ravel()\n",
+ "print(\"Flattening array: \",arr1)\n",
+ "```\n",
+ "1. The ```numpy.c_[xx.ravel(),yy.ravel()]``` is a NumPy operation that concatenates the flattened X and Y coordinates into a single 2D array."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Decision boundary visualization\n",
+ "We use the following code:\n",
+ "```python\n",
+ "plt.contourf(xx, yy, Z, alpha=0.8)\n",
+ "```\n",
+ "1. The ```contourf``` function from ```matplotlib``` is used to create __filled contour plots__.\n",
+ "1. A __filled contour plot__ is a graphical representation of a 3D surface where __regions of different values__ are filled with __different colors__.\n",
+ "1. It is often used in __machine learning__ to __visualize decision boundaries__.\n",
+ "1. The ```xx``` and ```yy``` are the __X__ and __Y__ coordinate grids created by ```np.meshgrid```. So, they __define__ the __grid of points__ in the __2D feature space__, where ```Z``` will be evaluated.\n",
+ "1. ```Z``` is the array of __predicted class labels__, typically generated by a machine learning model, for __each point in the feature space grid__.\n",
+ "1. The ```contourf``` function will use these labels to __determine how to fill different regions__ with colors based on the __class predictions__.\n",
+ "1. The __alpha parameter__ specifies the __opacity of the filled regions__ in the contour plot. It varies between __0 (transparent)__ and __1 (opaque)__.\n",
+ "1. The ```plt.cm``` is an __attribute__ of the ```matplotlib``` library that provides access to a variety of __built-in colormaps__. It is used to __specify the colormap__ you wish to apply to your plot."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Exercise 2: Nonlinear classification\n",
+ "In this exercise you will use the KNN classifier to find the optimal classification settings for nonlinear classification:\n",
+ "1. Firstly, you will create 200 synthetic instances with a nonlinear decision boundary through the make_moons function. Set the noise parameter (standard deviation of Gaussian noise) at 0.3.\n",
+ "1. Split the data into training and test sets.\n",
+ "1. Initialize variables to store the best k and corresponding accuracy.\n",
+ "1. Define a range of k values to test.\n",
+ "1. For each value within that range, calculate the accuracy on the test set, check if this k value performs the best fit so far and update the best k parameter accordingly.\n",
+ "1. Once you find the optimal parameter, train with that the kNN classifiee.\n",
+ "1. Make predictions on the test data.\n",
+ "1. Calculate accuracy, precision and recall.\n",
+ "1. Visualize the data points with the class information and the decision boundary.\n",
+ "1. Display the evaluation metrics."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# kNN Regression\n",
+ "1. We use the __[sklearn.neighbors.KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)__\n",
+ "1. It performs regression analysis based on k-nearest neighbors.\n",
+ "1. The __kNN regression algorithm__ relies on the principle that __similar data points__ should have __similar target values__.\n",
+ "1. It makes predictions by finding the __k nearest data points__ to the __query point__ (the point for which you want to make a prediction).\n",
+ "1. For __regression__, instead of __predicting a class label__, it predicts the __target value__ by taking the __average (simple or weighted)__ of the target values of the __k nearest neighbors__.\n",
+ "1. The __predicted value__ is a __continuous__ number.\n",
+ "1. It is __useful__ in scenarios where the relationship between the input features and the target variable is __not strictly linear__ and can have __complex patterns__."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Sorting the independent variable\n",
+ "1. We use the sort() function from the Numpy library.\n",
+ "1. Its syntax is ```np.sort(a, axis=-1,kind=None)```\n",
+ "1. It returns a __sorted copy__ of an array.\n",
+ "1. The parameter __a__ denotes the array to be sorted.\n",
+ "1. The optional parameter __axis__ (int/None), denotes the axis along which to sort. If __None__, the array is flattened before sorting. The __default__ is $-1$, which sorts along the last axis.\n",
+ "1. The optional parameter __kind__ {'quicksort','mergesort','heapsort','stable'}, defines the __sorting algorithm__. The 'quicksort' is the __default__ option.\n",
+ "1. The following code segment is used to sort the values in the array ```X``` according to the first axis, effectively sorting the data points based on their __X-coordinate__: \n",
+ "```python\n",
+ "X = np.sort(5 * np.random.rand(points, 1), axis=0)\n",
+ "```\n",
+ "1. This is __not a strictly necessary procedure__ but it has been done __to ensure that the data points are organized in a predictable way__ (ascending order based on their X-coordinate)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Nonlinear Data Generation Tips\n",
+ "1. Firstly, we use the ```np.sin(X)``` function to return a NumPy array with a __shape of ```(n,1)```__, where ```n``` is the __number of data points__.\n",
+ "1. This __shape__ represents a __column vector__.\n",
+ "1. However, the ```y``` variable is expected to be a __1D array / flat vector__ with a shape of __```n```__.\n",
+ "1. So, we use the __```ravel()```__ to __flatten__ the output of __```np.sin(x)```__ and __create a 1D array__ that matches the __expected shape__ of the ```y``` variable: \n",
+ "```python\n",
+ "y=np.sin(X).ravel()+noise\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# k-fold cross validation function\n",
+ "1. The __```sklearn.model_selection.cross_val_score```__ is used to evaluate a score by cross validation.\n",
+ "1. The __estimator parameter__ (already developed classification/regression model) is the __object__ to use for __data fitting__.\n",
+ "1. The __```X``` parameter__ (array,list) is the __data to fit__.\n",
+ "1. Its __shape__ is __```(n_samples,n_features)```__.\n",
+ "1. The __```y``` parameter__ is the __target variable__ used to __predict__ in the case of __supervised learning__.\n",
+ "1. The __```cv``` parameter defines__ the __cross-validation strategy__. It can be either an __integer__ (e.g. 10) or a __cross-validation object__ ('kFold', 'StratifiedkFold', 'TimeSeriesSplit'). If not specified, a 5-fold cross-validation is used __by default__.\n",
+ "1. The __```scoring``` parameter__ specifies the __scoring metric__ used to __evaluate__ the __model's performance__. It can be a __string__ with the name of a __built-in metric__ or a __custom scoring__ function.\n",
+ "1. The __```cross_val_score```__ function returns an __array of scores__, where each score __corresponds__ to one of the __cross-validation-folds__.\n",
+ "1. You can __compute statistics__ on these scores, such as the __mean score__ or __standard deviation__:\n",
+ "```python\n",
+ "knn_reg = KNeighborsRegressor(n_neighbors=k)\n",
+ "r2_scores = cross_val_score(knn_reg, X_train, y_train, cv=5, scoring='r2')\n",
+ "mean_r2 = np.mean(r2_scores)\n",
+ " \n",
+ " if mean_r2 > best_r2:\n",
+ " best_k = k\n",
+ " best_r2 = mean_r2\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Exercise 3 / Optional\n",
+ "This exercise (optional assignment) is for the interested ones that wish to investigate how the kNN Algorithm can be easily used as a powerful regression algorithm for nonlinear problems. More specifically, please follow these steps:\n",
+ "1. Generate 100 synthetic point data with a nonlinear relationship. The X-cordinates would be among 0-5 and the Y-coordinates would follow the relationship $y=sin(x)+ 0.1*noise$. Use the __```np.sort()```__ and __```np.ravel()```__ for better compliance.\n",
+ "1. __Split__ the data into training and test sets.\n",
+ "1. Perform __linear regression__.\n",
+ "1. Develop a __```for loop```__ to __investigate the optimal k-value__ for the kNN algorithm. The loop should investigate several k value options (e.g. 1-11). The evaluation should be performed according to the __coefficient of determination__.\n",
+ "1. Once you find the best k-value, use it to train the kNN regression algorithm using the __```KNeighborsRegressor```__.\n",
+ "1. Use the __test set__ to __predict the y-values__ for __both algorithms__.\n",
+ "1. __Evaluate both models__ in terms of the __coefficient of determination__.\n",
+ "1. __Generate__ a range (0-5) of X values for the regression lines.\n",
+ "1. __Use__ the previous step to predict the __Y values__ for both regression lines.\n",
+ "1. __Plot__ both __data points__ and __regression lines__.\n",
+ "1. __Display__ in the screen the evaluation metrics and the optimal k-value."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# References\n",
+ "__[Scikit-learn: The kNN Classification](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)__\n",
+ "
\n",
+ "__[Scikit-learn User Guide tor Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html)__\n",
+ "
\n",
+ "__[The make_moons function](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html)__\n",
+ "
\n",
+ "__[Creating 2D grids](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html)__ \n",
+ "
\n",
+ "__[Flattening 2D arrays](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html)__\n",
+ "
\n",
+ "__[Visualization of decision boundaries](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html)__\n",
+ "
\n",
+ "__[How to sort data and arrays](https://numpy.org/doc/stable/reference/generated/numpy.sort.html)__\n",
+ "
\n",
+ "__[k-fold cross validation function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)__"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Evaluation\n",
+ "\n",
+ "Please visit the following link for __[Workshop 4 Evaluation](https://app.wooclap.com/PIHHOO?from=event-page)__\n",
+ "
\n",
+ "Tell us your opinion about this workshop and how we could become better in the next one.\n",
+ "Your opinion matters!!!)__"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}