ACM Research Coding Challenge (Spring 2022)

Question

Binary classification is a type of classification task that labels elements of a set (i.e. dataset) into two different groups. An example of this type of classification would be identifying if people had a specific disease or not based on certain health characteristics. The dataset found in mushrooms.csv holds data (22 different characteristics, specifically) about different types of mushrooms, including a mushroom's cap shape, cap surface texture, cap color, bruising, odor, and more. Remember to split the data into test and training sets (you can choose your own percent split). Information about the meaning of the letters under each column can be found within the file attributelegend.txt.

With the file mushrooms.csv, use an algorithm of your choice to classify whether a mushroom is poisonous or edible.

Explaination for my answer

Liberies used

pandas
numpy
Matplotlib
seaborn
Sk-learn package
- preprocessing
- model_selection
- metrics
- linear_model

Documentations used

Approach

Starting this challenge, I had no idea what Machine Learning and binary classification were. To understand what it is, I searched "binary classification python tutorial" on Youtube, and found CS Dojo's video, which helped set up the environment and learn how to use Jupiter notebook.

Analyzing Data

After understanding how to use Jupiter notebook, I started to analyze the CSV data that will help me to understand which data can help to separate poisonous and edible mushrooms. So I graphed only the poisonous mushroom for each column and analyzed the population of each type. I discovered the Gil attachment, Gil-spacing, veil-color, ring-number, and veil-type had the largest poisonous mushroom population in each column.

Preprocessing

I divided it into two data, one with no info on class data but with all other attribute data, one with only class data. Then I converted all non-numerical data to numerical data using this. I finalized by splitting the test and ML train.

Machine Learning and Confusion Metrix

I used the Logistic Regression algorithm because it is widely used for binary classification. It uses the logit function for the outcome. The probability is generated in output and it is classified into 0 or 1, by using the sigmoid activation function. Using the Logistic Regression algorithm, I have achieved 97.05 % accuracy.

And looking at the confusion matrix, there were only 48 negatives from the result (21 False Negative, 28 False Positive).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.ipynb_checkpoints		.ipynb_checkpoints
ACM_Research_Coding_Challenge.ipynb		ACM_Research_Coding_Challenge.ipynb
README.md		README.md
attributelegend.txt		attributelegend.txt
mushrooms.csv		mushrooms.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACM Research Coding Challenge (Spring 2022)

Question

Explaination for my answer

Approach

Analyzing Data

Preprocessing

Machine Learning and Confusion Metrix

About

Releases

Packages

Languages

notJamesHan/ML-Binary-Classification

Folders and files

Latest commit

History

Repository files navigation

ACM Research Coding Challenge (Spring 2022)

Question

Explaination for my answer

Approach

Analyzing Data

Preprocessing

Machine Learning and Confusion Metrix

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages