This Python code utilizes the decision tree algorithm from the scikit-learn library to perform banknote authentication. The code aims to analyze the impact of different train-test split ratios and training set sizes on the accuracy and size of the learned decision tree.
The code uses the "BankNote_Authentication.csv" dataset, which contains four features (variance, skew, curtosis, and entropy) and a class attribute indicating whether a banknote is real or forged.
The following libraries are imported in the code:
sklearn.tree
: Provides the decision tree classifier.pandas
: Used for data manipulation and analysis.sklearn.model_selection.train_test_split
: Splits the data into training and testing sets.numpy
: Handles mathematical operations and array manipulation.matplotlib.pyplot
: Enables data visualization.
Calculates the accuracy of the predicted labels (y_pred
) compared to the actual labels (y_test
). Returns the accuracy as a floating-point value.
Performs an experiment with a specific train-test split ratio (splitRatio
) using the decision tree algorithm. Splits the data into training and testing sets, fits the decision tree model, and predicts the labels for the testing set. Returns the accuracy and the number of nodes in the decision tree.
Calculates the mean, maximum, and minimum values of an input array. Returns the statistics as a NumPy array.
Performs multiple experiments with a fixed train-test split ratio (splitRatio
). Reruns the experiment five times with different random splits of the data. Returns the accuracies and tree sizes for each experiment.
Plots the y-axis values against the training set size. Saves the plot as an image file with the specified fileName
.
The main function reads the dataset, separates the features (X) and the labels (Y), and initializes matrices for accuracy and tree size statistics. It then runs two sets of experiments:
- The function runs the experiment with a 75% training ratio, recording the accuracies and tree sizes for each iteration.
- The size of each iteration is displayed in the following table:
Set Size |
Accuracy |
25.0 |
0.9620991253644315 |
31.0 |
0.9630709426627794 |
39.0 |
0.956268221574344 |
27.0 |
0.967930029154519 |
31.0 |
0.9689018464528668 |
- The function iterates over a range of training set sizes (30% to 70%) and performs the experiment five times with different random seeds.
- For each training set size, it calculates the mean, maximum, and minimum accuracy and tree size for all iterations.
- The accuracy and tree size for each iteration are displayed in the following tables:
Iteration |
Mean |
Max |
Min |
30% |
0.96774 |
0.97815 |
0.95421 |
40% |
0.97282 |
0.97937 |
0.96723 |
50% |
0.97376 |
0.98834 |
0.96064 |
60% |
0.98069 |
0.98361 |
0.96903 |
70% |
0.97961 |
0.99029 |
0.9733 |
Iteration |
Mean |
Max |
Min |
30% |
31.8 |
37.0 |
25.0 |
40% |
37.4 |
41.0 |
35.0 |
50% |
35.8 |
45.0 |
27.0 |
60% |
41.0 |
47.0 |
35.0 |
70% |
47.0 |
51.0 |
41.0 |
To run the code, follow these steps:
- Install the required libraries:
sklearn
,pandas
,numpy
, andmatplotlib.pyplot
. - Download the "BankNote_Authentication.csv" dataset and place it in the same directory as the code file.
- Run the code. The main function will execute the experiments and generate the accuracy and tree size results.
- The code will also generate plots showing the accuracy and tree size against the training set size.
In conclusion, this Python code provides a practical implementation of banknote authentication using a decision tree algorithm. It allows for experimentation with different train-test split ratios and training set sizes, providing insights into how these factors affect the accuracy and size of the decision tree model.
Contributions are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request.
- Khaled Ashraf Hanafy Mahmoud - 20190186.
- Noura Ashraf Abdelnaby Mansour - 20190592.
- Samaa Khalifa Elsayed Othman - 20190247.
This program is licensed under the MIT License.