Uniting Features¶
A feature-union horizontally concatenates the pandas.DataFrame
results of multiple transformer objects.
This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.
In this chapter we’ll use the following Iris dataset:
>>> import numpy as np
>>> from sklearn import datasets
>>> import pandas as pd
>>>
>>> iris = datasets.load_iris()
>>> features, iris = iris['feature_names'], pd.DataFrame(
... np.c_[iris['data'], iris['target']],
... columns=iris['feature_names']+['class'])
>>>
>>> iris.columns
Index([...'sepal length (cm)', ...'sepal width (cm)', ...'petal length (cm)',
...'petal width (cm)', ...'class'],
dtype='object')
We’ll also use PCA and univariate feature selection:
>>> from ibex.sklearn.decomposition import PCA as PdPCA
>>> from ibex.sklearn.feature_selection import SelectKBest as PdSelectKBest
sklearn
Alternative¶
Using sklearn.pipeline.FeatureUnion
, we can create a feature-union of steps:
>>> from sklearn.pipeline import FeatureUnion
>>>
>>> trn = FeatureUnion([('pca', PdPCA(n_components=2)), ('best', PdSelectKBest(k=1))])
Note how the step names can be exactly specified. The name of the second step is 'best'
, even though that is unrelated to the name of the class.
>>> trn.transformer_list
[('pca', Adapter[PCA](...
...), ('best', Adapter[SelectKBest](...)]
Tip
Steps’ names are important, as they are used by
ibex.sklearn.pipeline.FeatureUnion.set_params()
and ibex.sklearn.pipeline.FeatureUnion.get_params()
.
Pipeline-Syntax Alternative¶
Using the pipeline syntax, we can use +
to create a pipeline:
>>> trn = PdPCA(n_components=2) + PdSelectKBest(k=1)
The output using this, however, discards the meaning of the columns:
>>> trn = PdPCA(n_components=2) + PdSelectKBest(k=1)>>> trn.fit_transform(iris[features], iris['class']) pca selectkbest comp_0 comp_1 petal length (cm) 0 -2.684207 ...0.326607 1.4 1 -2.715391 ...0.169557 1.4 2 -2.889820 ...0.137346 1.3 3 -2.746437 ...0.311124 1.5 4 -2.728593 ...0.333925 1.4 ...
A better way would be to combine this with ibex.trans()
:
>>> from ibex import trans >>> >>> trn = trans(PdPCA(n_components=2), out_cols=['pc1', 'pc2']) + trans(PdSelectKBest(k=1), out_cols='best', pass_y=True)>>> trn.fit_transform(iris[features], iris['class']) functiontransformer_0 functiontransformer_1 pc1 pc2 best 0 -2.684207 ...0.326607 1.4 1 -2.715391 ...0.169557 1.4 2 -2.889820 ...0.137346 1.3 3 -2.746437 ...0.311124 1.5 4 -2.728593 ...0.333925 1.4 ...
Note the names of the transformers:
>>> trn.transformer_list
[('functiontransformer_0', FunctionTransformer(func=Adapter[PCA](...
...
...
...)), ('functiontransformer_1', FunctionTransformer(func=Adapter[SelectKBest](...
...))]
This is similar to the discussion of pipeline_pipeline_syntax_alternative in Pipelining.
Note
Just as with sklearn.pipeline.Pipeline
vs. |
, also sklearn.pipeline.FeatureUnion
gives greater control over steps name
relative to +
. Note, however that FeatureUnion
provides control over further aspects, e.g., the ability to run steps in parallel.