API Reference

This page lists all of the estimators and top-level functions in dask_ml. Unless otherwise noted, the estimators implemented in dask-ml are appropriate for parallel and distributed training.

dask_ml.model_selection: Model Selection

Utilities for hyperparameter optimization.

These estimators will operate in parallel. Their scalability depends on the underlying estimators being used.

Dask-ML has a few cross validation utilities.

model_selection.train_test_split(*arrays, …) Split arrays into random train and test matricies.

model_selection.train_test_split() is a simple helper that uses model_selection.ShuffleSplit internally.

model_selection.ShuffleSplit([n_splits, …]) Random permutation cross-validator.
model_selection.KFold([n_splits, shuffle, …]) K-Folds cross-validator

Dask-ML provides drop-in replacements for grid and randomized search.

model_selection.GridSearchCV(estimator, …) Exhaustive search over specified parameter values for an estimator.
model_selection.RandomizedSearchCV(…[, …]) Randomized search on hyper parameters.

dask_ml.linear_model: Generalized Linear Models

The dask_ml.linear_model module implements linear models for classification and regression.

linear_model.LinearRegression([penalty, …]) Esimator for linear_regression.
linear_model.LogisticRegression([penalty, …]) Esimator for logistic_regression.
linear_model.PoissonRegression([penalty, …]) Esimator for poisson_regression.

dask_ml.wrappers: Meta-Estimators

dask-ml provides some meta-estimators that help use regular estimators that follow the scikit-learn API. These meta-estimators make the underlying estimator work well with Dask Arrays or DataFrames.

wrappers.ParallelPostFit([estimator, scoring]) Meta-estimator for parallel predict and transform.
wrappers.Incremental([estimator, scoring, …]) Metaestimator for feeding Dask Arrays to an estimator blockwise.

dask_ml.cluster: Clustering

Unsupervised Clustering Algorithms

cluster.KMeans([n_clusters, init, …]) Scalable KMeans for clustering
cluster.SpectralClustering([n_clusters, …]) Apply parallel Spectral Clustering

dask_ml.decomposition: Matrix Decomposition

decomposition.PCA([n_components, copy, …]) Principal component analysis (PCA)
decomposition.TruncatedSVD([n_components, …])

dask_ml.preprocessing: Preprocessing Data

Utilties for Preprocessing data.

class dask_ml.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)

Standardize features by removing the mean and scaling to unit variance

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean : boolean, True by default
If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently, unit standard deviation).
scale_ : ndarray, shape (n_features,)

Per feature relative scaling of the data.

New in version 0.17: scale_

mean_ : array of floats with shape [n_features]
The mean value for each feature in the training set.
var_ : array of floats with shape [n_features]
The variance for each feature in the training set. Used to compute scale_
n_samples_seen_ : int
The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.
>>> from sklearn.preprocessing import StandardScaler
>>>
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[ 0.5  0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[ 3.  3.]]

scale: Equivalent function without the estimator API.

sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

fit(X, y=None)

Compute the mean and std to be used for later scaling.

X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along the features axis.

y : Passthrough for Pipeline compatibility.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(X, copy=None)

Scale back the data to the original representation

X : array-like, shape [n_samples, n_features]
The data used to scale along the features axis.
copy : bool, optional (default: None)
Copy the input X or not.
X_tr : array-like, shape [n_samples, n_features]
Transformed array.
partial_fit(X, y=None)

Online computation of mean and std on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along the features axis.

y : Passthrough for Pipeline compatibility.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X, y=None, copy=None)

Perform standardization by centering and scaling

X : array-like, shape [n_samples, n_features]
The data used to scale along the features axis.
y : (ignored)

Deprecated since version 0.19: This parameter will be removed in 0.21.

copy : bool, optional (default: None)
Copy the input X or not.
class dask_ml.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)

Transforms features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

feature_range : tuple (min, max), default=(0, 1)
Desired range of transformed data.
copy : boolean, optional, default True
Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
min_ : ndarray, shape (n_features,)
Per feature adjustment for minimum.
scale_ : ndarray, shape (n_features,)

Per feature relative scaling of the data.

New in version 0.17: scale_ attribute.

data_min_ : ndarray, shape (n_features,)

Per feature minimum seen in the data

New in version 0.17: data_min_

data_max_ : ndarray, shape (n_features,)

Per feature maximum seen in the data

New in version 0.17: data_max_

data_range_ : ndarray, shape (n_features,)

Per feature range (data_max_ - data_min_) seen in the data

New in version 0.17: data_range_

>>> from sklearn.preprocessing import MinMaxScaler
>>>
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> print(scaler.data_max_)
[  1.  18.]
>>> print(scaler.transform(data))
[[ 0.    0.  ]
 [ 0.25  0.25]
 [ 0.5   0.5 ]
 [ 1.    1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[ 1.5  0. ]]

minmax_scale: Equivalent function without the estimator API.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

fit(X, y=None)

Compute the minimum and maximum to be used for later scaling.

X : array-like, shape [n_samples, n_features]
The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(X, y=None, copy=None)

Undo the scaling of X according to feature_range.

X : array-like, shape [n_samples, n_features]
Input data that will be transformed. It cannot be sparse.
partial_fit(X, y=None)

Online computation of min and max on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.

X : array-like, shape [n_samples, n_features]
The data used to compute the mean and standard deviation used for later scaling along the features axis.

y : Passthrough for Pipeline compatibility.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X, y=None, copy=None)

Scaling features of X according to feature_range.

X : array-like, shape [n_samples, n_features]
Input data that will be transformed.
class dask_ml.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature (or each sample, depending on the axis argument) by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

New in version 0.17.

Read more in the User Guide.

with_centering : boolean, True by default
If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_scaling : boolean, True by default
If True, scale the data to interquartile range.
quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0

Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.

New in version 0.18.

copy : boolean, optional, default is True
If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
center_ : array of floats
The median value for each feature in the training set.
scale_ : array of floats

The (scaled) interquartile range for each feature in the training set.

New in version 0.17: scale_ attribute.

robust_scale: Equivalent function without the estimator API.

sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

https://en.wikipedia.org/wiki/Median_(statistics) https://en.wikipedia.org/wiki/Interquartile_range

fit(X, y=None)

Compute the median and quantiles to be used for scaling.

X : array-like, shape [n_samples, n_features]
The data used to compute the median and quantiles used for later scaling along the features axis.
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(X)

Scale back the data to the original representation

X : array-like
The data used to scale along the specified axis.

This implementation was copied and modified from Scikit-Learn.

See License information here: https://github.com/scikit-learn/scikit-learn/blob/master/README.rst

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X)

Center and scale the data.

Can be called on sparse input, provided that RobustScaler has been fitted to dense input and with_centering=False.

X : {array-like, sparse matrix}
The data used to scale along the specified axis.

This implementation was copied and modified from Scikit-Learn.

See License information here: https://github.com/scikit-learn/scikit-learn/blob/master/README.rst

class dask_ml.preprocessing.QuantileTransformer(n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)

Transforms features using quantile information.

This implementation differs from the scikit-learn implementation by using approximate quantiles. The scikit-learn docstring follows.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. The cumulative density function of a feature is used to project the original values. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Read more in the User Guide.

n_quantiles : int, optional (default=1000)
Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative density function.
output_distribution : str, optional (default=’uniform’)
Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
ignore_implicit_zeros : bool, optional (default=False)
Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
subsample : int, optional (default=1e5)
Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Note that this is used by subsampling and smoothing noise.
copy : boolean, optional, (default=True)
Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).
quantiles_ : ndarray, shape (n_quantiles, n_features)
The values corresponding the quantiles of reference.
references_ : ndarray, shape(n_quantiles, )
Quantiles of references.
>>> import numpy as np
>>> from sklearn.preprocessing import QuantileTransformer
>>> rng = np.random.RandomState(0)
>>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X) 
array([...])

quantile_transform : Equivalent function without the estimator API. StandardScaler : perform standardization that is faster, but less robust

to outliers.
RobustScaler : perform robust standardization that removes the influence
of outliers but does not put outliers and inliers on the same scale.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

fit(X, y=None)

Compute the quantiles used for transforming.

X : ndarray or sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.
self : object
Returns self
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(X)

Back-projection to the original space.

X : ndarray or sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.
Xt : ndarray or sparse matrix, shape (n_samples, n_features)
The projected data.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X)

Feature-wise transformation of the data.

X : ndarray or sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.
Xt : ndarray or sparse matrix, shape (n_samples, n_features)
The projected data.
class dask_ml.preprocessing.Categorizer(categories=None, columns=None)

Transform columns of a DataFrame to categorical dtype.

This is a useful pre-processing step for dummy, one-hot, or categorical encoding.

categories : mapping, optional

A dictionary mapping column name to instances of pandas.api.types.CategoricalDtype. Alternatively, a mapping of column name to (categories, ordered) tuples.

columns : sequence, optional

A sequence of column names to limit the categorization to. This argument is ignored when categories is specified.

This transformer only applies to dask.DataFrame and pandas.DataFrame. By default, all object-type columns are converted to categoricals. The set of categories will be the values present in the column and the categoricals will be unordered. Pass dtypes to control this behavior.

All other columns are included in the transformed output untouched.

For dask.DataFrame, any unknown categoricals will become known.

columns_ : pandas.Index
The columns that were categorized. Useful when categories is None, and we detect the categorical and object columns
categories_ : dict
A dictionary mapping column names to dtypes. For pandas>=0.21.0, the values are instances of pandas.api.types.CategoricalDtype. For older pandas, the values are tuples of (categories, ordered).
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']})
>>> ce = Categorizer()
>>> ce.fit_transform(df).dtypes
A       int64
B    category
dtype: object
>>> ce.categories_
{'B': CategoricalDtype(categories=['a', 'b'], ordered=False)}

Using CategoricalDtypes for specifying the categories:

>>> from pandas.api.types import CategoricalDtype
>>> ce = Categorizer(categories={"B": CategoricalDtype(['a', 'b', 'c'])})
>>> ce.fit_transform(df).B.dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
fit(X, y=None)

Find the categorical columns.

X : pandas.DataFrame or dask.DataFrame y : ignored

self

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X, y=None)

Transform the columns in X according to self.categories_.

X : pandas.DataFrame or dask.DataFrame y : ignored

X_trn : pandas.DataFrame or dask.DataFrame
Same type as the input. The columns in self.categories_ will be converted to categorical dtype.
class dask_ml.preprocessing.DummyEncoder(columns=None, drop_first=False)

Dummy (one-hot) encode categorical columns.

columns : sequence, optional
The columns to dummy encode. Must be categorical dtype. Dummy encodes all categorical dtype columns by default.
drop_first : bool, default False
Whether to drop the first category in each column.
columns_ : Index
The columns in the training data before dummy encoding
transformed_columns_ : Index
The columns in the training data after dummy encoding
categorical_columns_ : Index
The categorical columns in the training data
noncategorical_columns_ : Index
The rest of the columns in the training data
categorical_blocks_ : dict
Mapping from column names to slice objects. The slices represent the positions in the transformed array that the categorical column ends up at
dtypes_ : dict

Dictionary mapping column name to either

  • instances of CategoricalDtype (pandas >= 0.21.0)
  • tuples of (categories, ordered)

This transformer only applies to dask and pandas DataFrames. For dask DataFrames, all of your categoricals should be known.

The inverse transformation can be used on a dataframe or array.

>>> data = pd.DataFrame({"A": [1, 2, 3, 4],
...                      "B": pd.Categorical(['a', 'a', 'a', 'b'])})
>>> de = DummyEncoder()
>>> trn = de.fit_transform(data)
>>> trn
A  B_a  B_b
0  1    1    0
1  2    1    0
2  3    1    0
3  4    0    1
>>> de.columns_
Index(['A', 'B'], dtype='object')
>>> de.non_categorical_columns_
Index(['A'], dtype='object')
>>> de.categorical_columns_
Index(['B'], dtype='object')
>>> de.dtypes_
{'B': CategoricalDtype(categories=['a', 'b'], ordered=False)}
>>> de.categorical_blocks_
{'B': slice(1, 3, None)}
>>> de.fit_transform(dd.from_pandas(data, 2))
Dask DataFrame Structure:
                A    B_a    B_b
npartitions=2
0              int64  uint8  uint8
2                ...    ...    ...
3                ...    ...    ...
Dask Name: get_dummies, 4 tasks
fit(X, y=None)

Determine the categorical columns to be dummy encoded.

X : pandas.DataFrame or dask.dataframe.DataFrame y : ignored

self

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(X)

Inverse dummy-encode the columns in X

X : array or dataframe
Either the NumPy, dask, or pandas version
data : DataFrame
Dask array or dataframe will return a Dask DataFrame. Numpy array or pandas dataframe will return a pandas DataFrame
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X, y=None)

Dummy encode the categorical columns in X

X : pd.DataFrame or dd.DataFrame y : ignored

transformed : pd.DataFrame or dd.DataFrame
Same type as the input
class dask_ml.preprocessing.OrdinalEncoder(columns=None)

Ordinal (integer) encode categorical columns.

columns : sequence, optional
The columns to encode. Must be categorical dtype. Encodes all categorical dtype columns by default.
columns_ : Index
The columns in the training data before/after encoding
categorical_columns_ : Index
The categorical columns in the training data
noncategorical_columns_ : Index
The rest of the columns in the training data
dtypes_ : dict

Dictionary mapping column name to either

  • instances of CategoricalDtype (pandas >= 0.21.0)
  • tuples of (categories, ordered)

This transformer only applies to dask and pandas DataFrames. For dask DataFrames, all of your categoricals should be known.

The inverse transformation can be used on a dataframe or array.

>>> data = pd.DataFrame({"A": [1, 2, 3, 4],
...                      "B": pd.Categorical(['a', 'a', 'a', 'b'])})
>>> enc = OrdinalEncoder()
>>> trn = enc.fit_transform(data)
>>> trn
   A  B
0  1  0
1  2  0
2  3  0
3  4  1
>>> enc.columns_
Index(['A', 'B'], dtype='object')
>>> enc.non_categorical_columns_
Index(['A'], dtype='object')
>>> enc.categorical_columns_
Index(['B'], dtype='object')
>>> enc.dtypes_
{'B': CategoricalDtype(categories=['a', 'b'], ordered=False)}
>>> enc.fit_transform(dd.from_pandas(data, 2))
Dask DataFrame Structure:
                   A     B
npartitions=2
0              int64  int8
2                ...   ...
3                ...   ...
Dask Name: assign, 8 tasks
fit(X, y=None)

Determine the categorical columns to be encoded.

X : pandas.DataFrame or dask.dataframe.DataFrame y : ignored

self

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(X)

Inverse ordinal-encode the columns in X

X : array or dataframe
Either the NumPy, dask, or pandas version
data : DataFrame
Dask array or dataframe will return a Dask DataFrame. Numpy array or pandas dataframe will return a pandas DataFrame
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(X, y=None)

Ordinal encode the categorical columns in X

X : pd.DataFrame or dd.DataFrame y : ignored

transformed : pd.DataFrame or dd.DataFrame
Same type as the input
class dask_ml.preprocessing.LabelEncoder(use_categorical=True)

Encode labels with value between 0 and n_classes-1.

Note

This differs from the scikit-learn version for Categorical data. When passed a categorical y, this implementation will use the categorical information for the label encoding and transformation. You will receive different answers when

  1. Your categories are not monotonically increasing
  2. You have unobserved categories

Specify use_categorical=False to recover the scikit-learn behavior.

use_categorical : bool, default True
Whether to use the categorical dtype information when y is a dask or pandas Series with a categorical dtype.
classes_ : array of shape (n_class,)
Holds the label for each class.
dtype_ : Optional CategoricalDtype
For Categorical y, the dtype is stored here.

LabelEncoder can be used to normalize labels.

>>> from dask_ml import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) 
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

When using Dask, we strongly recommend using a Categorical dask Series if possible. This avoids a (potentially expensive) scan of the values and enables a faster transform algorithm.

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> data = dd.from_pandas(pd.Series(['a', 'a', 'b'], dtype='category'),
...                       npartitions=2)
>>> le.fit_transform(data)
dask.array<values, shape=(nan,), dtype=int8, chunksize=(nan,)>
>>> le.fit_transform(data).compute()
array([0, 0, 1], dtype=int8)
fit(y)

Fit label encoder

y : array-like of shape (n_samples,)
Target values.

self : returns an instance of self.

fit_transform(y)

Fit label encoder and return encoded labels

y : array-like of shape [n_samples]
Target values.

y : array-like of shape [n_samples]

get_params(deep=True)

Get parameters for this estimator.

deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
inverse_transform(y)

Transform labels back to original encoding.

y : numpy array of shape [n_samples]
Target values.

y : numpy array of shape [n_samples]

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

self

transform(y)

Transform labels to normalized encoding.

y : array-like of shape [n_samples]
Target values.

y : array-like of shape [n_samples]

preprocessing.StandardScaler([copy, …]) Standardize features by removing the mean and scaling to unit variance
preprocessing.RobustScaler([with_centering, …]) Scale features using statistics that are robust to outliers.
preprocessing.MinMaxScaler([feature_range, copy]) Transforms features by scaling each feature to a given range.
preprocessing.QuantileTransformer([…]) Transforms features using quantile information.
preprocessing.StandardScaler([copy, …]) Standardize features by removing the mean and scaling to unit variance
preprocessing.Categorizer([categories, columns]) Transform columns of a DataFrame to categorical dtype.
preprocessing.DummyEncoder([columns, drop_first]) Dummy (one-hot) encode categorical columns.
preprocessing.OrdinalEncoder([columns]) Ordinal (integer) encode categorical columns.
preprocessing.LabelEncoder([use_categorical]) Encode labels with value between 0 and n_classes-1.

dask_ml.compose: Composite Estimators

Meta-estimators for building composite models with transformers.

compose.ColumnTransformer
compose.make_column_transformer

dask_ml.impute: Imputing Missing Data

preprocessing.impute.SimpleImputer

dask_ml.metrics: Metrics

Score functions, performance metrics, and pairwise distance computations.

Regression Metrics

metrics.mean_absolute_error(y_true, y_pred) Mean squared error regression loss
metrics.mean_squared_error(y_true, y_pred[, …]) Mean squared error regression loss
metrics.r2_score(y_true, y_pred[, …]) R^2 (coefficient of determination) regression score function.

Classification Metrics

metrics.accuracy_score(y_true, y_pred[, …]) Accuracy classification score.
metrics.log_loss(y_true, y_pred[, eps, …]) Log loss, aka logistic loss or cross-entropy loss.

dask_ml.tensorflow: Tensorflow

start_tensorflow

dask_ml.xgboost: XGBoost

XGBClassifier
XGBRegressor
train
predict