# Incremental Learning¶

Some estimators can be trained incrementally – without seeing the entire
dataset at once. Scikit-Learn provdes the `partial_fit`

API to stream batches
of data to an estimator that can be fit in batches.

Normally, if you pass a Dask Array to an estimator expecting a NumPy array, the Dask Array will be converted to a single, large NumPy array. On a single machine, you’ll likely run out of RAM and crash the program. On a distributed cluster, all the workers will send their data to a single machine and crash it.

`dask_ml.wrappers.Incremental`

provides a bridge between Dask and
Scikit-Learn estimators supporting the `partial_fit`

API. You wrap the
underlying estimator in `Incremental`

. Dask-ML will sequentially pass each
block of a Dask Array to the underlying estimator’s `partial_fit`

method.

## Incremental Meta-estimator¶

`wrappers.Incremental` ([estimator, scoring, …]) |
Metaestimator for feeding Dask Arrays to an estimator blockwise. |

`dask_ml.wrappers.Incremental`

is a meta-estimator (an estimator that
takes another estimator) that bridges scikit-learn estimators expecting
NumPy arrays, and users with large Dask Arrays.

Each *block* of a Dask Array is fed to the underlying estiamtor’s
`partial_fit`

method. The training is entirely sequential, so you won’t
notice massive training time speedups from parallelism. In a distributed
environment, you should notice some speedup from avoiding extra IO, and the
fact that models are typically much smaller than data, and so faster to move
between machines.

```
In [1]: from dask_ml.datasets import make_classification
In [2]: from dask_ml.wrappers import Incremental
In [3]: from sklearn.linear_model import SGDClassifier
In [4]: X, y = make_classification(chunks=25)
In [5]: X
Out[5]: dask.array<normal, shape=(100, 20), dtype=float64, chunksize=(25, 20)>
In [6]: estimator = SGDClassifier(random_state=10, max_iter=100)
In [7]: clf = Incremental(estimator)
In [8]: clf.fit(X, y, classes=[0, 1])
Out[8]:
Incremental(estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=10, shuffle=True,
tol=None, verbose=0, warm_start=False),
random_state=None, scoring=None, shuffle_blocks=True)
```

In this example, we make a (small) random Dask Array. It has 100 samples, broken in the 4 blocks of 25 samples each. The chunking is only along the first axis (the samples). There is no chunking along the features.

You instantiate the underlying estimator as usual. It really is just a
scikit-learn compatible estimator, and will be trained normally via its
`partial_fit`

.

Notice that we call the regular `.fit`

method, not `partial_fit`

for
training. Dask-ML takes care of passing each block to the underlying estimator
for you.

Just like `sklearn.linear_model.SGDClassifier.partial_fit()`

, we need to
pass the `classes`

argument to `fit`

. In general, any argument that is
required for the underlying estimators `parital_fit`

becomes required for
the wrapped `fit`

.

Note

Take care with the behavior of `Incremental.score()`

. Most estimators
inherit the default scoring methods of R2 score for regressors and accuracy
score for classifiers. For these estimators, we automatically use Dask-ML’s
scoring methods, which are able to operate on Dask arrays.

If your underlying estimator uses a different scoring method, you’ll need
to ensure that the scoring method is able to operate on Dask arrays. You
can also explicitly pass `scoring=`

to pass a dask-aware scorer.

We can get the accuracy score on our dataset.

```
In [9]: clf.score(X, y)
Out[9]: 0.66
```

All of the attributes learned durning training, like `coef_`

, are available
on the `Incremental`

instance.

```
In [10]: clf.coef_
Out[10]:
array([[-3.87697485e+01, 1.26584857e+01, 1.87150599e+01,
-2.70747837e+01, 3.73381123e+01, 2.17271018e+01,
8.48468581e+00, 1.30247905e+00, 1.70305938e+01,
1.30879222e+01, 1.54894296e+01, 9.74946977e+00,
1.10897874e+01, 1.96946533e+01, -2.33983632e+00,
-8.12584269e-03, 7.46519250e+01, 9.51162165e+00,
-5.33063198e+01, 2.70544337e+00]])
```

If necessary, the actual estimator trained is available as `Incremental.estimator_`

```
In [11]: clf.estimator_
Out[11]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=10, shuffle=True,
tol=None, verbose=0, warm_start=False)
```

### Incremental Learning and Hyper-parameter Optimization¶

`Incremental`

is a meta-estimator.
To search over the hyper-parameters of the underlying estimator, use the usual scikit-learn convention of
prefixing the parameter name with `<name>__`

. For `Incremental`

, `name`

is always `estimator`

.

```
In [12]: from sklearn.model_selection import GridSearchCV
In [13]: param_grid = {'estimator__alpha': [0.10, 10.0]}
In [14]: gs = GridSearchCV(clf, param_grid)
In [15]: gs.fit(X, y, classes=[0, 1])
Out[15]:
GridSearchCV(cv=None, error_score='raise',
estimator=Incremental(estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=100, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=10, shuffle=True,
tol=None, verbose=0, warm_start=False),
random_state=None, scoring=None, shuffle_blocks=True),
fit_params=None, iid=True, n_jobs=1,
param_grid={'estimator__alpha': [0.1, 10.0]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)
```

This can be mixed with Joblib to use a cluster for training in parallel, even if you’re RAM-bound.