setup server > events > feature extraction pipeline > runtime scorer > microservice scorer > predictions
Feature extraction pipelines allow you to define a repeatable process to transform a set of input features before you build a machine learning model on a final set of features. When the resulting model is put into production the feature pipeline will need to be rerun on each input feature set before being passed to the model for scoring.
Seldon feature pipelines are presently available in python. We plan to provide Spark based pipelines in the future.
Seldon provides a set of python modules to help construct feature pipelines for use inside Seldon. We use scikit-learn pipelines and Pandas. For feature extraction and transformation we provide a starter set of python scikit-learn Tranformers that take Pandas dataframes as input apply some transformations and output Pandas dataframes. There is also the ability to use any existing sklearn Transformer on Pandas dataframes with sklearn_transform.
The currently available example transforms are:
Several small examples can be found in python/examples
import seldon.pipeline.sklearn_transform as ssk
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame.from_dict([{"a":1.0,"b":2.0},{"a":2.0,"b":3.0}])
t = ssk.sklearn_transform(input_features=["a"],output_features=["a_scaled"],transformer=StandardScaler())
t.fit(df)
df_2 = t.transform(df)
print df_2
When run this would print:
python sklearn_scaler.py
a b a_scaled
0 1 2 -1
1 2 3 1
import seldon.pipeline.auto_transforms as auto
import pandas as pd
df = pd.DataFrame([{"a":10,"b":1,"c":"cat"},{"a":5,"b":2,"c":"dog","d":"Nov 13 08:36:29 2015"},{"a":10,"b":3,"d":"Oct 13 10:50:12 2015"}])
t = auto.Auto_transform(max_values_numeric_categorical=2,date_cols=["d"])
t.fit(df)
df2 = t.transform(df)
print df2
This would:
The converted DataFrame would be:
a b c d d_h1 d_h2 d_hour \
0 a_10 -1.224745 cat NaT NaN NaN hnan
1 a_5 0.000000 dog 2015-11-13 08:36:29 0.866025 -0.500000 h8
2 a_10 1.224745 UKN 2015-10-13 10:50:12 0.500000 -0.866025 h10
d_m1 d_m2 d_month d_w d_w1 d_w2 d_year
0 NaN NaN mnan wnan NaN NaN ynan
1 -0.500000 0.866025 m11 w4 -0.433884 -0.900969 y2015
2 -0.866025 0.500000 m10 w1 0.781831 0.623490 y2015
As a final stage of any pipeline you would usually add a sklearn Estimator. We provide 3 builtin Estimators which wrap some popular machine learning toolkits and allow Pandas dataframes as input. There is also a general Estimator that can take any sckit-learn compatible estimator.
An example pipeline to do very simple extraction on the Iris dataset is contained within the code at python/docker/examples/iris
. This contains pipelines that utilize Seldons Docker pipeline and create the following python pipelines:
The pipeline utilizing XGBoost is shown below
import sys, getopt, argparse
import seldon.pipeline.basic_transforms as bt
import seldon.pipeline.util as sutl
import seldon.pipeline.auto_transforms as pauto
from sklearn.pipeline import Pipeline
import seldon.xgb as xg
import sys
def run_pipeline(events,models):
tNameId = bt.Feature_id_transform(min_size=0,exclude_missing=True,zero_based=True,input_feature="name",output_feature="nameId")
tAuto = pauto.Auto_transform(max_values_numeric_categorical=2,exclude=["nameId","name"])
xgb = xg.XGBoostClassifier(target="nameId",target_readable="name",excluded=["name"],learning_rate=0.1,silent=0)
transformers = [("tName",tNameId),("tAuto",tAuto),("xgb",xgb)]
p = Pipeline(transformers)
pw = sutl.Pipeline_wrapper()
df = pw.create_dataframe(events)
df2 = p.fit(df)
pw.save_pipeline(p,models)
if __name__ == '__main__':
parser = argparse.ArgumentParser(prog='xgb_pipeline')
parser.add_argument('--events', help='events folder', required=True)
parser.add_argument('--models', help='output models folder', required=True)
args = parser.parse_args()
opts = vars(args)
run_pipeline([args.events],args.models)
The example is explained in more detail here
There are two modules for helping in testing and optimizing pipelines:
There is a notebook showing how to use these in a simple example.
Further examples can be found in python/examples