Diver
is the D
ataset I
nspector, V
isualiser and E
ncoder
library, automating and codifying common data science project steps as standardised and reusable methods.
See example-notebooks/house-price-demo.ipynb
for a full walkthrough or follow this link: https://tinyurl.com/ye9hfbzp.
A set of functions which help perform checks for common dataset issues which can impact machine learning model performance.
A scikit-learn
-formatted module which can perform various data-type encodings in a single go, and save the associated attributes from a train-set encoding to reuse on a test-set encoding:
- The
.fit_transform
method learns various encodings (feature means and variances; categorical feature elements - yellow in the flow chart below) and then performs the various encodings on the feature train set - The
.transform
method applies train-set encodings to a test set
Functions for visualising aspects of the dataset
- Display the correlation matrix for the top
n
correlating features (n
specified by the user) against the dependent variable (at the bottom row of the matrix)
- MAJOR: 0. -
- MINOR: 2. - New Sklearn single feature missing value imputers (mean, median, zero, most frequent) replace previous manual implementations
- BUGFIX: 0. -
- Option for instances where there are no categorical features
-
Choose between either {use means from train set (default), calculate means for test set}
-
Implement missing value imputation: https://measuringu.com/handle-missing-data/
-
GOOD READING: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
- Create a function to do this
- is_public_holiday : bool https://pypi.org/project/holidays/
- Update above diagram
- Encode year linked to overall numeric encoding
- Memorise training set settings (cardinality reductions, cut features) as attributes in order to apply the same settings to test set
fit_transform
/transform
format as withdataset_conditioner
- Seems to be missing timestamps
- Inspector
- Conditioner
- Display correlation matrix for top
n
correlates alongside target at the bottom - Display pairplot for top
n
correlates alongside target at the bottom - Or instead of
top n
correlates, instead threshold ofcumulative variance
- Option to DROOP lower correlates (lower than threshold) if desired