Pipelining

A pipeline is a sequential composition of a number of transformers, and a final estimator. Ibex allows pipeline compositions in both the original sklearn explicit way, as well as a more succinct pipeline-syntax version.

In this chapter we’ll use the following Iris dataset:

>>> import numpy as np
>>> from sklearn import datasets
>>> import pandas as pd
>>>
>>> iris = datasets.load_iris()
>>> features, iris = iris['feature_names'], pd.DataFrame(
...     np.c_[iris['data'], iris['target']],
...     columns=iris['feature_names']+['class'])
>>>
>>> iris.columns
Index([...'sepal length (cm)', ...'sepal width (cm)', ...'petal length (cm)',
       ...'petal width (cm)', ...'class'],
      dtype='object')

We’ll also use SVC and PCA:

>>> from ibex.sklearn.svm import SVC as PdSVC
>>> from ibex.sklearn.decomposition import PCA as PdPCA

sklearn Alternative

Using sklearn.pipeline.Pipeline, we can create a pipeline of steps:

>>> from sklearn.pipeline import Pipeline
>>>
>>> clf = Pipeline([('pca', PdPCA(n_components=2)), ('svm', PdSVC(kernel="linear"))])

Note how the step names can be exactly specified. The name of the second step is 'svm', even though that is unrelated to the name of the class.

>>> clf.steps
[('pca', Adapter[PCA](...
  ...)), ('svm', Adapter[SVC](...
      ...
      ...
  ...))]

Pipeline-Syntax Alternative

Using the pipeline syntax, we can use | to create a pipeline:

>>> clf = PdPCA(n_components=2) | PdSVC(kernel="linear")

Note that the name of the second step is 'svc':

>>> clf.steps
[('pca', Adapter[PCA](...
  ...)), ('svc', Adapter[SVC](...
      ...
      ...
  ...))]

This is because the name of the class (in lowercase) is 'svc':

>>> PdSVC.__name__.lower()
'svc'

In fact, this is exactly the behavior of sklearn.pipeline.make_pipeline(). The make_pipeline function, however, does not allow using same-class objects, as the names would be duplicated. Ibex allows this by detecting this, and numbering same-class steps:

>>> from ibex import trans
>>>
>>> (trans(np.sin) | trans(np.cos)). steps
[('functiontransformer_0', FunctionTransformer(...
          ...)), ('functiontransformer_1', FunctionTransformer(...
          ...))]
>>>
>>> (trans(np.sin) | trans(np.cos) | trans(np.tan)). steps
[('functiontransformer_0', FunctionTransformer(...
          ...)), ('functiontransformer_1', FunctionTransformer(...
          ...)), ('functiontransformer_2', FunctionTransformer(...
          ...))]

This alternative, therefore, is more succinct, but allows less control over the steps’ names.