dask-ml provides some meta-estimators that parallelize and scaling out
certain tasks that may not be parallelized within scikit-learn itself.
ParallelPostFit will parallelize the
transform methods, enabling them
to work on large (possibly larger-than memory) datasets.
Parallel Prediction and Transformation¶
wrappers.ParallelPostFit is a meta-estimator for parallelizing
post-fit tasks like prediction and transformation. It can wrap any
scikit-learn estimator to provide parallel
ParallelPostFit does not parallelize the training step. The underlying
.fit method is called normally.
Since just the
transform methods are
wrappers.ParallelPostFit is most useful in situations where
your training dataset is relatively small (fits in a single machine’s memory),
and prediction or transformation must be done on a much larger dataset (perhaps
larger than a single machine’s memory).
from sklearn.ensemble import GradientBoostingClassifier import sklearn.datasets import dask_ml.datasets from dask_ml.wrappers import ParallelPostFit
In this example, we’ll make a small 1,000 sample training dataset
X, y = sklearn.datasets.make_classification(n_samples=1000, random_state=0)
Training is identical to just calling
estimator.fit(X, y). Aside from
copying over learned attributes, that’s all that
clf = ParallelPostFit(estimator=GradientBoostingClassifier()) clf.fit(X, y)
This class is useful for predicting for or transforming large datasets.
We’ll make a larger dask array
X_big with 10,000 samples per block.
X_big, _ = dask_ml.datasets.make_classification(n_samples=100000, chunks=10000, random_state=0) clf.predict(X_big)
This returned a
dask.array. Like any dask array, the actual
cause the scheduler to compute tasks in parallel. If you’ve connected to a
dask.distributed.Client, the computation will be parallelized across your
cluster of machines.
See parallelizing prediction for an example of how this scales for a support vector classifier.
Comparison to other Estimators in dask-ml¶
dask-ml re-implements some estimators from scikit-learn, for example
dask_ml.preprocessing.QuantileTransformer. This raises
the question, should I use the reimplemented dask-ml versions, or should I wrap
scikit-learn version in a meta-estimator? It varies from estimator to estimator,
and depends on your tolerance for approximate solutions and the size of your
training data. In general, if your training data is small, you should be fine
wrapping the scikit-learn version with a