# Cross Validation¶

See the scikit-learn cross validation documentation for a fuller discussion of cross validation. This document only describes the extensions made to support Dask arrays.

The simplest way to split one or more Dask arrays is with dask_ml.model_selection.train_test_split().

import dask.array as da

X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50)
X


The interface for splitting Dask arrays is the same as scikit-learn’s version.

X_train, X_test, y_train, y_test = train_test_split(X, y)

While it’s possible to pass dask arrays to sklearn.model_selection.train_test_split(), we recommend using the Dask version for performance reasons. There are two major difference that let make the Dask version faster.
Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing. For very large datasets, creating and transmitting np.arange(n_samples) can be expensive.