API Reference

This page lists all of the estimators and top-level functions in dask_ml. Unless otherwise noted, the estimators implemented in dask-ml are appropriate for parallel and distributed training.

dask_ml.model_selection: Model Selection

Utilities for hyperparameter optimization.

These estimators will operate in parallel. Their scalability depends on the underlying estimators being used.

Dask-ML has a few cross validation utilities.

model_selection.train_test_split(*arrays, …) Split arrays into random train and test matricies.

model_selection.train_test_split() is a simple helper that uses model_selection.ShuffleSplit internally.

model_selection.ShuffleSplit([n_splits, …]) Random permutation cross-validator.

Dask-ML provides drop-in replacements for grid and randomized search.

model_selection.GridSearchCV(estimator, …) Exhaustive search over specified parameter values for an estimator.
model_selection.RandomizedSearchCV(…[, …]) Randomized search on hyper parameters.

dask_ml.linear_model: Generalized Linear Models

The dask_ml.linear_model module implements linear models for classification and regression.

linear_model.LinearRegression([penalty, …]) Esimator for linear_regression.
linear_model.LogisticRegression([penalty, …]) Esimator for logistic_regression.
linear_model.PoissonRegression([penalty, …]) Esimator for poisson_regression.

Meta-estimators for scikit-learn

dask-ml provides some meta-estimators that help use regular scikit-learn compatible estimators with Dask arrays.

wrappers.ParallelPostFit([estimator, scoring]) Meta-estimator for parallel predict and transform.
wrappers.Incremental([estimator, scoring, …]) Metaestimator for feeding Dask Arrays to an estimator blockwise.

dask_ml.cluster: Clustering

Unsupervised Clustering Algorithms

cluster.KMeans([n_clusters, init, …]) Scalable KMeans for clustering
cluster.SpectralClustering([n_clusters, …]) Apply parallel Spectral Clustering

dask_ml.decomposition: Matrix Decomposition

decomposition.PCA([n_components, copy, …]) Principal component analysis (PCA)
decomposition.TruncatedSVD([n_components, …])

dask_ml.preprocessing: Preprocessing Data

Utilties for Preprocessing data.

preprocessing.StandardScaler([copy, …]) Standardize features by removing the mean and scaling to unit variance
preprocessing.RobustScaler([with_centering, …]) Scale features using statistics that are robust to outliers.
preprocessing.MinMaxScaler([feature_range, copy]) Transforms features by scaling each feature to a given range.
preprocessing.QuantileTransformer([…]) Transforms features using quantile information.
preprocessing.Categorizer([categories, columns]) Transform columns of a DataFrame to categorical dtype.
preprocessing.DummyEncoder([columns, drop_first]) Dummy (one-hot) encode categorical columns.
preprocessing.OrdinalEncoder([columns]) Ordinal (integer) encode categorical columns.
preprocessing.LabelEncoder Encode labels with value between 0 and n_classes-1.

dask_ml.metrics: Metrics

Score functions, performance metrics, and pairwise distance computations.

Regression Metrics

metrics.mean_absolute_error(y_true, y_pred) Mean squared error regression loss
metrics.mean_squared_error(y_true, y_pred[, …]) Mean squared error regression loss
metrics.r2_score(y_true, y_pred[, …]) R^2 (coefficient of determination) regression score function.

Classification Metrics

metrics.accuracy_score(y_true, y_pred[, …]) Accuracy classification score.

dask_ml.tensorflow: Tensorflow

start_tensorflow

dask_ml.xgboost: XGBoost

Train an XGBoost model on dask arrays or dataframes.

This may be used for training an XGBoost model on a cluster. XGBoost will be setup in distributed mode alongside your existing dask.distributed cluster.

XGBClassifier([max_depth, learning_rate, …])
Attributes:
XGBRegressor([max_depth, learning_rate, …])
Attributes:
train(client, params, data, labels[, …]) Train an XGBoost model on a Dask Cluster
predict(client, model, data) Distributed prediction with XGBoost