Pipelining¶
A pipeline is a sequential composition of a number of transformers, and a final estimator. Ibex allows pipeline compositions in both the original sklearn
explicit way, as well as a more succinct pipeline-syntax version.
In this chapter we’ll use the following Iris dataset:
>>> import numpy as np
>>> from sklearn import datasets
>>> import pandas as pd
>>>
>>> iris = datasets.load_iris()
>>> features, iris = iris['feature_names'], pd.DataFrame(
... np.c_[iris['data'], iris['target']],
... columns=iris['feature_names']+['class'])
>>>
>>> iris.columns
Index([...'sepal length (cm)', ...'sepal width (cm)', ...'petal length (cm)',
...'petal width (cm)', ...'class'],
dtype='object')
We’ll also use SVC and PCA:
>>> from ibex.sklearn.svm import SVC as PdSVC
>>> from ibex.sklearn.decomposition import PCA as PdPCA
sklearn
Alternative¶
Using sklearn.pipeline.Pipeline
, we can create a pipeline of steps:
>>> from sklearn.pipeline import Pipeline
>>>
>>> clf = Pipeline([('pca', PdPCA(n_components=2)), ('svm', PdSVC(kernel="linear"))])
Note how the step names can be exactly specified. The name of the second step is 'svm'
, even though that is unrelated to the name of the class.
>>> clf.steps
[('pca', Adapter[PCA](...
...)), ('svm', Adapter[SVC](...
...
...
...))]
Tip
Steps’ names are important, as they are used by
sklearn.pipeline.Pipeline.set_params()
and sklearn.pipeline.Pipeline.get_params()
.
Pipeline-Syntax Alternative¶
Using the pipeline syntax, we can use |
to create a pipeline:
>>> clf = PdPCA(n_components=2) | PdSVC(kernel="linear")
Note that the name of the second step is 'svc'
:
>>> clf.steps
[('pca', Adapter[PCA](...
...)), ('svc', Adapter[SVC](...
...
...
...))]
This is because the name of the class (in lowercase) is 'svc'
:
>>> PdSVC.__name__.lower()
'svc'
In fact, this is exactly the behavior of sklearn.pipeline.make_pipeline()
. The make_pipeline
function, however, does not allow using same-class objects, as the names would be duplicated. Ibex allows this by detecting this, and numbering same-class steps:
>>> from ibex import trans
>>>
>>> (trans(np.sin) | trans(np.cos)). steps
[('functiontransformer_0', FunctionTransformer(...
...)), ('functiontransformer_1', FunctionTransformer(...
...))]
>>>
>>> (trans(np.sin) | trans(np.cos) | trans(np.tan)). steps
[('functiontransformer_0', FunctionTransformer(...
...)), ('functiontransformer_1', FunctionTransformer(...
...)), ('functiontransformer_2', FunctionTransformer(...
...))]
This alternative, therefore, is more succinct, but allows less control over the steps’ names.