Verification and Processing¶
Since sklearn
is defined in terms of numpy.ndarray
(and not pandas.DataFrame
), Ibex estimators perform verification and processing on their inputs and outputs.
In this chapter we’ll use a DataFrame
X
, with columns 'a'
and 'b'
, and (implied) index 1, 2, 3
.
>>> import pandas as pd
>>> X = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
a scaling transformer trn
which is fit
-ted on X
>>> from ibex.sklearn import preprocessing as pd_preprocessing
>>> trn = pd_preprocessing.StandardScaler().fit(X)
and a linear-regression predictor prd
which is also fit
-ted on X
>>> from ibex.sklearn import linear_model as pd_linear_model
>>> prd = pd_linear_model.LinearRegression().fit(X, pd.Series([3, 4]))
Input Verification¶
Following the call to fit
, we can apply further methods of trn
to any DataFrame
with the same column-set. For example, this is OK
>>> X_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]})
>>> trn.transform(X_1)
a b
0 -1... -1...
1 1... 1...
2 3... 3...
but this is not
>>> X_2 = X_1.rename(columns={'b': 'c'})
>>> trn.transform(X_2)
Traceback (most recent call last):
...
KeyError: "...'b'...not in index"
Once an estimator has been fit
-ed, the order of columns of further inputs no longer matters:
>>> trn.transform(X_1[['a', 'b']])
a b
0 -1... -1...
1 1... 1...
2 3... 3...
>>> trn.transform(X_1[['b', 'a']])
a b
0 -1... -1...
1 1... 1...
2 3... 3...
The step
will reorder the DataFrame
to the same order of columns seen by fit
.
Output Processing¶
Indexes¶
The index of a returned DataFrame
or Series
objects, is that of the input:
>>> X_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}, index=[10, 20, 30])
>>> trn.transform(X_1)
a b
10 -1... -1...
20 1... 1...
30 3... 3...
>>>
>>> prd.predict(X_1)
10 3...
20 4...
30 5...
dtype: ...
DataFrame
Columns¶
In general, the columns of an outputted DataFrame
object are those on which the estimator was fit
-ted:
>>> trn.transform(X_1[['a', 'b']])
a b
10 -1... -1...
20 1... 1...
30 3... 3...
>>> trn.transform(X_1[['b', 'a']])
a b
10 -1... -1...
20 1... 1...
30 3... 3...
Some outputted DataFrame
objects have a number of columns that is different from that of the input. If this is the case, the resulting DataFrame
’s columns will all be blank strings (''
):
# Tmp Ami
>>> from ibex.sklearn import decomposition as pd_decomposition
>>> pd_decomposition.PCA(n_components=1).fit(X).transform(X)
comp_0
0 -0.707107
1 0.707107
Note
In some cases, we might want greater control over the naming of output columns. For example, when transforming a 2-component PCA, we might want to name the DataFrame
columns 'pc1'
and 'pc2'
. Specifying Output Columns in Transforming shows how to do this.