# Hyper Parameter Search¶

Tools for performing hyperparameter optimization of Scikit-Learn API-compatible models using Dask. Dask-ML implements GridSearchCV and RandomizedSearchCV.

`sklearn.model_selection.GridSearchCV` (…[, …]) |
Exhaustive search over specified parameter values for an estimator. |

`dask_ml.model_selection.GridSearchCV` (…[, …]) |
Exhaustive search over specified parameter values for an estimator. |

`sklearn.model_selection.RandomizedSearchCV` (…) |
Randomized search on hyper parameters. |

`dask_ml.model_selection.RandomizedSearchCV` (…) |
Randomized search on hyper parameters. |

The varians in Dask-ML implement many (but not all) of the same parameters, and should be a drop-in replacement for the subset that they do implement. In that case, why use Dask-ML’s versions?

- Flexible Backends: Hyperparameter optimization can be done in parallel using threads, processes, or distributed across a cluster.
- Works well with Dask collections. Dask
arrays, dataframes, and delayed can be passed to
`fit`

. - Avoid repeated work. Candidate estimators with
identical parameters and inputs will only be fit once. For
composite-estimators such as
`Pipeline`

this can be significantly more efficient as it can avoid expensive repeated computations.

Both scikit-learn’s and Dask-ML’s model selection meta-estimators can be used with Dask’s joblib backend.

## Flexible Backends¶

Dask-searchcv can use any of the dask schedulers. By default the threaded scheduler is used, but this can easily be swapped out for the multiprocessing or distributed scheduler:

```
# Distribute grid-search across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
search.fit(digits.data, digits.target)
```

## Works Well With Dask Collections¶

Dask collections such as `dask.array`

, `dask.dataframe`

and
`dask.delayed`

can be passed to `fit`

. This means you can use dask to do
your data loading and preprocessing as well, allowing for a clean workflow.
This also allows you to work with remote data on a cluster without ever having
to pull it locally to your computer:

```
import dask.dataframe as dd
# Load data from s3
df = dd.read_csv('s3://bucket-name/my-data-*.csv')
# Do some preprocessing steps
df['x2'] = df.x - df.x.mean()
# ...
# Pass to fit without ever leaving the cluster
search.fit(df[['x', 'x2']], df['y'])
```

## Avoid Repeated Work¶

When searching over composite estimators like `sklearn.pipeline.Pipeline`

or
`sklearn.pipeline.FeatureUnion`

, Dask-ML will avoid fitting the same
estimator + parameter + data combination more than once. For pipelines with
expensive early steps this can be faster, as repeated work is avoided.

For example, given the following 3-stage pipeline and grid (modified from this scikit-learn example).

```
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier())])
grid = {'vect__ngram_range': [(1, 1)],
'tfidf__norm': ['l1', 'l2'],
'clf__alpha': [1e-3, 1e-4, 1e-5]}
```

the Scikit-Learn grid-search implementation looks something like (simplified):

```
scores = []
for ngram_range in parameters['vect__ngram_range']:
for norm in parameters['tfidf__norm']:
for alpha in parameters['clf__alpha']:
vect = CountVectorizer(ngram_range=ngram_range)
X2 = vect.fit_transform(X, y)
tfidf = TfidfTransformer(norm=norm)
X3 = tfidf.fit_transform(X2, y)
clf = SGDClassifier(alpha=alpha)
clf.fit(X3, y)
scores.append(clf.score(X3, y))
best = choose_best_parameters(scores, parameters)
```

As a directed acyclic graph, this might look like:

In contrast, the dask version looks more like:

```
scores = []
for ngram_range in parameters['vect__ngram_range']:
vect = CountVectorizer(ngram_range=ngram_range)
X2 = vect.fit_transform(X, y)
for norm in parameters['tfidf__norm']:
tfidf = TfidfTransformer(norm=norm)
X3 = tfidf.fit_transform(X2, y)
for alpha in parameters['clf__alpha']:
clf = SGDClassifier(alpha=alpha)
clf.fit(X3, y)
scores.append(clf.score(X3, y))
best = choose_best_parameters(scores, parameters)
```

With a corresponding directed acyclic graph:

Looking closely, you can see that the Scikit-Learn version ends up fitting earlier steps in the pipeline multiple times with the same parameters and data. Due to the increased flexibility of Dask over Joblib, we’re able to merge these tasks in the graph and only perform the fit step once for any parameter/data/estimator combination. For pipelines that have relatively expensive early steps, this can be a big win when performing a grid search.

### Pipelines¶

Dask-ML uses scikit-learn’s `sklearn.pipeline.Pipeline`

to express
pipelines of estimators that are chained together. If the individual
estimators work well with Dask’s collections, the pipeline will as well.