# Preprocessing¶

`dask_ml.preprocessing`

contains some scikit-learn style transformers that
can be used in `Pipelines`

to perform various data transformations as part
of the model fitting process. These transformers will work well on dask
collections (`dask.array`

, `dask.dataframe`

), NumPy arrays, or pandas
dataframes. They’ll fit and transform in parallel.

## Scikit-Learn Clones¶

Some of the transformers are (mostly) drop-in replacements for their scikit-learn counterparts.

`MinMaxScaler` ([feature_range, copy]) |
Transforms features by scaling each feature to a given range. |

`QuantileTransformer` ([n_quantiles, …]) |
Transforms features using quantile information. |

`RobustScaler` ([with_centering, with_scaling, …]) |
Scale features using statistics that are robust to outliers. |

`StandardScaler` ([copy, with_mean, with_std]) |
Standardize features by removing the mean and scaling to unit variance |

`LabelEncoder` |
Encode labels with value between 0 and n_classes-1. |

These can be used just like the scikit-learn versions, except that:

- They operate on dask collections in parallel
`.transform`

will return a`dask.array`

or`dask.dataframe`

when the input is a dask collection

See `sklearn.preprocessing`

for more information about any particular
transformer.

## Additional Tranformers¶

Other transformers are specific to dask-ml.

`Categorizer` ([categories, columns]) |
Transform columns of a DataFrame to categorical dtype. |

`DummyEncoder` ([columns, drop_first]) |
Dummy (one-hot) encode categorical columns. |

`OrdinalEncoder` ([columns]) |
Ordinal (integer) encode categorical columns. |

Both `dask_ml.preprocessing.Categorizer`

and
`dask_ml.preprocessing.DummyEncoder`

deal with converting non-numeric
data to numeric data. They are useful as a preprocessing step in a pipeline
where you start with heterogenous data (a mix of numeric and non-numeric), but
the estimator requires all numeric data.

In this toy example, we use a dataset with two columns. `'A'`

is numeric and
`'B'`

contains text data. We make a small pipeline to

- Categorize the text data
- Dummy encode the categorical data
- Fit a linear regression

```
In [1]: from dask_ml.preprocessing import Categorizer, DummyEncoder
In [2]: from sklearn.linear_model import LogisticRegression
In [3]: from sklearn.pipeline import make_pipeline
In [4]: import pandas as pd
In [5]: import dask.dataframe as dd
In [6]: df = pd.DataFrame({"A": [1, 2, 1, 2], "B": ["a", "b", "c", "c"]})
In [7]: X = dd.from_pandas(df, npartitions=2)
In [8]: y = dd.from_pandas(pd.Series([0, 1, 1, 0]), npartitions=2)
In [9]: pipe = make_pipeline(
...: Categorizer(),
...: DummyEncoder(),
...: LogisticRegression()
...: )
...:
In [10]: pipe.fit(X, y)
Out[10]:
Pipeline(memory=None,
steps=[('categorizer', Categorizer(categories=None, columns=None)), ('dummyencoder', DummyEncoder(columns=None, drop_first=False)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])
```

`Categorizer`

will convert a subset of the columns in `X`

to categorical
dtype (see here
for more about how pandas handles categorical data). By default, it converts all
the `object`

dtype columns.

`DummyEncoder`

will dummy (or one-hot) encode the dataset. This replaces a
categorical column with multiple columns, where the values are either 0 or 1,
depending on whether the value in the original.

```
In [11]: df['B']
Out[11]:
0 a
1 b
2 c
3 c
Name: B, dtype: object
In [12]: pd.get_dummies(df['B'])
Out[12]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 1
```

Wherever the original was `'a'`

, the transformed now has a `1`

in the `a`

column and a `0`

everywhere else.

Why was the `Categorizizer`

step necessary? Why couldn’t we operate directly
on the `object`

(string) dtype column? Doing this would be fragile,
especially when using `dask.dataframe`

, since *the shape of the output would
depend on the values present*. For example, suppose that we just saw the first
two rows in the training, and the last two rows in the tests datasets. Then,
when training, our transformed columns would be:

```
In [13]: pd.get_dummies(df.loc[[0, 1], 'B'])
Out[13]:
a b
0 1 0
1 0 1
```

while on the test dataset, they would be:

```
In [14]: pd.get_dummies(df.loc[[2, 3], 'B'])
Out[14]:
c
2 1
3 1
```

Which is incorrect! The columns don’t match.

When we categorize the data, we can be confident that all the possible values
have been specified, so the output shape no longer depends on the values in the
whatever subset of the data we currently see. Instead, it depends on the
`categories`

, which are identical in all the subsets.