HyANOVA is a pure python implementation of fuctional ANOVA algorithm, which can be used to analyze the importance of hyperparameters in machine learning algorithm.
To install the package, please use the pip
installation as follows:
pip install hyanova
Here is a short example of usage. You can download the data from the example folder.
import hyanova
path = './iris[GridSearchCV]Model1.csv' # gridsearch results generated by sklearn
metric = 'mean_test_score' # metric for model performance
df,params = hyanova.read_csv(path,metric)
# df,params = hyanova.read_df(df,metric) You can also load data from pd.DataFrame
importance = hyanova.analyze(df)
The metric
is the feature you choose to evaluate the model
performance, it must appears in the .csv
file or the
pandas.DataFrame
object's column. And the result you got will be
similar to this below, see the next section(ANOVA) for more details.
>>> print(importance)
u v_u F_u(v_u/v_all)
0 (alpha,) 0.056885 0.892057
1 (l1_ratio,) 0.002489 0.039030
2 (alpha, l1_ratio) 0.004394 0.068912
HyANOVA is designed to analyze the grid search results generated by sklearn. It provides two ways to load the data.
You can use read_df(df,metric)
to load data from a
<class 'pandas.core.frame.DataFrame'>
object.
Parameters:
- df: <class 'pandas.core.frame.DataFrame'>, the
DataFrame
you want to analyze.- metric: string, the metric you choose.
Returns:
- result_df: <class 'pandas.core.frame.DataFrame'>,a
DataFrame
with all hyperparameters' value and the value of metric you choose- params_list: list, a
list
of all hyperparameters' name.
Use hyanova.read_csv(path,metric)
to load data from .csv
file.
It is equivalent to hyanova.read_df(pandas.read_csv(path),metric)
.
Parameters:
- path: string, path of the
DataFrame
you want to analyze.- metric: string, the metric you choose.
Returns:
- result_df: <class 'pandas.core.frame.DataFrame'>,a
DataFrame
with all hyperparameters' value and the value of metric you choose- params_list: list, a
list
of all hyperparameters' name.
The template file can be find at the example folder. Here is an example.
>>> print(df.head)
mean_fit_time std_fit_time mean_score_time std_score_time param_alpha \
0 0.003899 0.000194 0.048513 0.007621 0.000977
1 0.003401 0.000584 0.042454 0.011295 0.000977
2 0.002706 0.000502 0.048544 0.009059 0.000977
3 0.003304 0.000531 0.040709 0.003031 0.000977
4 0.001801 0.000116 0.000289 0.000014 0.000977
param_l1_ratio params \
0 0.00 {'alpha': 0.0009765625, 'l1_ratio': 0.0}
1 0.25 {'alpha': 0.0009765625, 'l1_ratio': 0.25}
2 0.50 {'alpha': 0.0009765625, 'l1_ratio': 0.5}
3 0.75 {'alpha': 0.0009765625, 'l1_ratio': 0.75}
4 1.00 {'alpha': 0.0009765625, 'l1_ratio': 1.0}
split0_test_score split1_test_score split2_test_score mean_test_score \
0 0.828571 0.971429 0.971429 0.923810
1 0.885714 0.971429 0.942857 0.933333
2 0.885714 1.000000 0.942857 0.942857
3 0.885714 0.914286 0.914286 0.904762
4 0.885714 1.000000 0.942857 0.942857
std_test_score rank_test_score
0 0.067344 4
1 0.035635 3
2 0.046657 1
3 0.013469 5
4 0.046657 1
>>> df,params = hyanova.read_df(df,'mean_test_score')
>>> print(df.head)
alpha l1_ratio mean_test_score
0 0.000977 0.00 0.923810
1 0.000977 0.25 0.933333
2 0.000977 0.50 0.942857
3 0.000977 0.75 0.904762
4 0.000977 1.00 0.942857
>>> print(params)
['alpha', 'l1_ratio']
Use hyanova.analyze(df,max_iter=-1)
to do the functional ANOVA
decomposition.
Parameters:
- df: <class 'pandas.core.frame.DataFrame'>, the
DataFrame
you want to analyze.- max_iter: int, default to -1.
Returns:
- result_df: <class 'pandas.core.frame.DataFrame'>
The df
parameter needs a pnadas.DataFrame
object which has a
format similar to the following table. You can use the methods HyANOVA
provides to load data easily.
alpha | l1_ratio | meantestscore | |
---|---|---|---|
0 | 0.00977 | 0.00 | 0.923810 |
1 | 0.00977 | 0.25 | 0.933333 |
2 | 0.00977 | 0.50 | 0.942857 |
3 | 0.00977 | 0.75 | 0.904762 |
Note: The metric(meantestscore) should always be in the last column.
The hyanova.analyze(df)
will return a DataFrame
with
hyperparameters' name, variance(vu) and the importance(Fu).
>>> importance = hyanova.analyze(df)
100%|██████████████████████████████████| 3/3 [00:00<00:00, 11.32 it/s]
>>> print(importance)
u v_u F_u(v_u/v_all)
0 (alpha,) 0.056885 0.892057
1 (l1_ratio,) 0.002489 0.039030
2 (alpha, l1_ratio) 0.004394 0.068912
Note: The Fu is the ratio of the variance caused by the hyperparameter itself(vu) to the variance of all trials(vall), so all Fu sums always equal to 1.See references for more details.
Due to the performance limitations of Python, the functional ANOVA will
be very slow when the number of hyperparameters is high (more than 5).
You can end the analysis early by setting the max_iter
parameter. In
fact, we usually only need the univariate importance, so set the
max_iter
parameter to equal the number of features for shorter
runtime.
>>> importance = hyanova.analyze(df,max_iter=2)
100%|██████████████████████████████████| 2/2 [00:00<00:00, 8.12 it/s]
>>> print(importance)
u v_u F_u(v_u/v_all)
0 (alpha,) 0.056885 0.892057
1 (l1_ratio,) 0.002489 0.039030
You can use sklearn to do hyperparameters search and then use hyanova to analyze the importance of hyperparameters.
import sklearn.datasets
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import pandas as pd
import hyanova
iris = sklearn.datasets.load_iris()
X = iris.data
y = iris.target
model = SVC()
grid = {'C': np.linspace(1e-9, 128, 10000)
'kernel': ('rbf', 'linear', 'poly', 'sigmoid')}
grid_search = GridSearchCV(model,grid)
result = grid_search.fit(X, y)
df = pd.DataFrame(result.cv_results_)
metric = 'mean_test_score'
df, params = hyanova.read_df(df,metric)
importance = hyanova.analyze(df)
- numpy
- pandas
- tqdm
I am completing my undergraduate thesis. In order to better understand the models used in my article, I looked for a lot of algorithms that can measure the importance of hyperparameters. Among them, functional ANOVA seems to be the most effective. But the original author's implementation is based on java and uses python to call java files, which confuses me. I hope there is a module that is easier to understand and implemented completely based on python, which can help me with ANOVA decomposition, so I created HyANOVA. Hope that will help you too!
- Hutter, F., Hoos, H. & Leyton-Brown, K.. (2014). An Efficient Approach for Assessing Hyperparameter Importance. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(1):754-762
- https://github.com/frank-hutter/fanova