Skip to content

Machine Learning Binary Classification using Jupiter Notebook and Matpoltlib

Notifications You must be signed in to change notification settings

notJamesHan/ML-Binary-Classification

 
 

Repository files navigation

ACM Research Coding Challenge (Spring 2022)

Question

Binary classification is a type of classification task that labels elements of a set (i.e. dataset) into two different groups. An example of this type of classification would be identifying if people had a specific disease or not based on certain health characteristics. The dataset found in mushrooms.csv holds data (22 different characteristics, specifically) about different types of mushrooms, including a mushroom's cap shape, cap surface texture, cap color, bruising, odor, and more. Remember to split the data into test and training sets (you can choose your own percent split). Information about the meaning of the letters under each column can be found within the file attributelegend.txt.

With the file mushrooms.csv, use an algorithm of your choice to classify whether a mushroom is poisonous or edible.

Explaination for my answer

Liberies used

  • pandas
  • numpy
  • Matplotlib
  • seaborn
  • Sk-learn package
    • preprocessing
    • model_selection
    • metrics
    • linear_model

Documentations used

Approach

Starting this challenge, I had no idea what Machine Learning and binary classification were. To understand what it is, I searched "binary classification python tutorial" on Youtube, and found CS Dojo's video, which helped set up the environment and learn how to use Jupiter notebook.

Analyzing Data

After understanding how to use Jupiter notebook, I started to analyze the CSV data that will help me to understand which data can help to separate poisonous and edible mushrooms. So I graphed only the poisonous mushroom for each column and analyzed the population of each type. I discovered the Gil attachment, Gil-spacing, veil-color, ring-number, and veil-type had the largest poisonous mushroom population in each column.

Preprocessing

I divided it into two data, one with no info on class data but with all other attribute data, one with only class data. Then I converted all non-numerical data to numerical data using this. I finalized by splitting the test and ML train.

Machine Learning and Confusion Metrix

I used the Logistic Regression algorithm because it is widely used for binary classification. It uses the logit function for the outcome. The probability is generated in output and it is classified into 0 or 1, by using the sigmoid activation function. Using the Logistic Regression algorithm, I have achieved 97.05 % accuracy.

And looking at the confusion matrix, there were only 48 negatives from the result (21 False Negative, 28 False Positive).

About

Machine Learning Binary Classification using Jupiter Notebook and Matpoltlib

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%