Extending¶
Writing new estimators is easy. One way of doing this is by writing a estimator conforming to the sickit-learn protocol, and then wrapping it with ibex.frame()
(see Adapting Estimators). A different way is writing it directly as a pandas
estimator. This might be the only way to go, if the logic of the estimator is pandas
specific. This chapter shows how to write a new estimator from scratch.
Example Transformation¶
Suppose we have a pandas.DataFrame
like this:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'a': [1, 3, 2, 1, 2], 'b': range(5), 'c': range(2, 7)})
>>> df
a b c
0 1 0 2
1 3 1 3
2 2 2 4
3 1 3 5
4 2 4 6
We think that, for each row, the mean values of 'b'
and 'c'
, aggregated by 'a'
, might make a useful feature. In pandas
, we could write this as follows:
>>> df.groupby(df.a).transform(np.mean)
b c
0 1.5 3.5
1 1.0 3.0
2 3.0 5.0
3 1.5 3.5
4 3.0 5.0
We now want write a transformer to do this, in order to use it for more general settings (e.g., cross validation).
Writing A New Transformer Step¶
We can write a (slightly more general) estimator, as follows:
>>> from sklearn import base
>>> import ibex
>>> class GroupbyAggregator(
... base.BaseEstimator, # (1)
... base.TransformerMixin, # (2)
... ibex.FrameMixin): # (3)
...
... def __init__(self, group_col, agg_func=np.mean):
... self._group_col, self._agg_func = group_col, agg_func
...
... def fit(self, X, _=None):
... self.x_columns = X.columns # (4)
... self._agg = X.groupby(df[self._group_col]).apply(self._agg_func)
... return self
...
... def transform(self, X):
... Xt = X[self.x_columns] # (5)
... Xt = pd.merge(
... Xt[[self._group_col]],
... self._agg,
... how='left')
... return Xt[[c for c in Xt.columns if c != self._group_col]]
Note the following general points:
- We subclass
sklearn.base.BaseEstimator
, as this is an estimator. - We subclass
sklearn.base.TransformerMixin
, as, in this case, this is specifically a transformer. - We subclass
ibex.FrameMixin
, as this estimator deals withpandas
entities. - In
fit
, we make sure to setibex.FrameMixin.x_columns
; this will ensure that the transformer will “remember” the columns it should see in further calls. - In
transform
, we first usex_columns
. This will verify the columns ofX
, and also reorder them according to the original order seen infit
(if needed).
The rest is logic specific to this transformer.
- In
__init__
, the group column and aggregation function are stored. - In
fit
,X
is aggregated by the group column according to the aggregation function, and the result is recorded. - In
transform
,X
(which is not necessarily the one used infit
) is left-merged with the aggregation result, and then the relevant columns of the result are returned.
We can now use this as a regular step. If we fit it on df
and transform it on the same df
, we get the result above:
>>> GroupbyAggregator('a').fit(df).transform(df)
b c
0 1.5 3.5
1 1.0 3.0
2 3.0 5.0
3 1.5 3.5
4 3.0 5.0
We can, however, now use it for fitting on one DataFrame
, and transforming another:
>>> try:
... from sklearn.model_selection import train_test_split
... except: # Older sklearn versions
... from ibex.sklearn.cross_validation import train_test_split
>>>
>>> tr, te = train_test_split(df, random_state=3)
>>> GroupbyAggregator('a').fit(tr).transform(te)
b c
0 0... 2...
1 2... 4...