Uniting Features

A feature-union horizontally concatenates the pandas.DataFrame results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

In this chapter we’ll use the following Iris dataset:

>>> import numpy as np
>>> from sklearn import datasets
>>> import pandas as pd
>>>
>>> iris = datasets.load_iris()
>>> features, iris = iris['feature_names'], pd.DataFrame(
...     np.c_[iris['data'], iris['target']],
...     columns=iris['feature_names']+['class'])
>>>
>>> iris.columns
Index([...'sepal length (cm)', ...'sepal width (cm)', ...'petal length (cm)',
       ...'petal width (cm)', ...'class'],
      dtype='object')

We’ll also use PCA and univariate feature selection:

>>> from ibex.sklearn.decomposition import PCA as PdPCA
>>> from ibex.sklearn.feature_selection import SelectKBest as PdSelectKBest

sklearn Alternative

Using sklearn.pipeline.FeatureUnion, we can create a feature-union of steps:

>>> from sklearn.pipeline import FeatureUnion
>>>
>>> trn = FeatureUnion([('pca', PdPCA(n_components=2)), ('best', PdSelectKBest(k=1))])

Note how the step names can be exactly specified. The name of the second step is 'best', even though that is unrelated to the name of the class.

>>> trn.transformer_list
[('pca', Adapter[PCA](...
  ...), ('best', Adapter[SelectKBest](...)]

Tip

Steps’ names are important, as they are used by ibex.sklearn.pipeline.FeatureUnion.set_params() and ibex.sklearn.pipeline.FeatureUnion.get_params().

Pipeline-Syntax Alternative

Using the pipeline syntax, we can use + to create a pipeline:

>>> trn = PdPCA(n_components=2) + PdSelectKBest(k=1)

The output using this, however, discards the meaning of the columns:

>>> trn = PdPCA(n_components=2) + PdSelectKBest(k=1)
>>> trn.fit_transform(iris[features], iris['class'])
          pca                 selectkbest
       comp_0    comp_1 petal length (cm)
0   -2.684207 ...0.326607               1.4
1   -2.715391 ...0.169557               1.4
2   -2.889820 ...0.137346               1.3
3   -2.746437 ...0.311124               1.5
4   -2.728593 ...0.333925               1.4
    ...

A better way would be to combine this with ibex.trans():

>>> from ibex import trans
>>>
>>> trn = trans(PdPCA(n_components=2), out_cols=['pc1', 'pc2']) + trans(PdSelectKBest(k=1), out_cols='best', pass_y=True)
>>> trn.fit_transform(iris[features], iris['class'])
    functiontransformer_0           functiontransformer_1
                      pc1       pc2                  best
0               -2.684207 ...0.326607                   1.4
1               -2.715391 ...0.169557                   1.4
2               -2.889820 ...0.137346                   1.3
3               -2.746437 ...0.311124                   1.5
4               -2.728593 ...0.333925                   1.4
    ...

Note the names of the transformers:

>>> trn.transformer_list
[('functiontransformer_0', FunctionTransformer(func=Adapter[PCA](...
  ...
          ...
          ...)), ('functiontransformer_1', FunctionTransformer(func=Adapter[SelectKBest](...
          ...))]

This is similar to the discussion of pipeline_pipeline_syntax_alternative in Pipelining.

Note

Just as with sklearn.pipeline.Pipeline vs. |, also sklearn.pipeline.FeatureUnion gives greater control over steps name relative to +. Note, however that FeatureUnion provides control over further aspects, e.g., the ability to run steps in parallel.