This repo includes a notebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:
- Deploying a RandomForestRegressor in PySpark
- Deployment of ML Pipeline that scales numerical features
- Deployment of ML Pipeline that is capable of preprocessing mixed feature types
Making predictions in PySpark using sophistaicated python ml is unlocked using our spark_predict
function defined below.
spark_predict
is a wrapper around a pandas_udf
, a wrapper is used to enable a python ml model to be passed to the pandas_udf
.
def spark_predict(model, cols) -> pyspark.sql.column:
"""This function deploys python ml in PySpark using the `predict` method of `model.
Args:
model: python ml model with sklearn API
cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.
"""
@sf.pandas_udf(returnType=DoubleType())
def predict_pandas_udf(*cols):
# cols will be a tuple of pandas.Series here.
x = pd.concat(cols, axis=1)
return pd.Series(model.predict(x))
return predict_pandas_udf(*cols)
The deploying-python-ml-in-pyspark notebook demonstrates how spark_predict
can be used to deploy python ML in PySpark. It is shown that spark_predict
is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.
I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearn Pipeline
designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using the spark_predict
function.
See requirements.txt.
The code used in the deploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.
- The code used in is based on the excellent excellent blog post "Prediction at Scale with scikit-learn and PySpark Pandas UDFs" written by Michael Heilman.
- sklearn has more information on column transformers with mixed types.