Thanks for helping to build
Cloning the Repository¶
Make a fork of the dask-ml repo and clone the fork
git clone https://github.com/<your-github-username>/dask-ml cd dask-ml
You may want to add
https://github.com/dask/dask-ml as an upstream remote
git remote add upstream https://github.com/dask/dask-ml
Creating an environment¶
We have conda environment YAML files with all the necessary dependencies
ci directory. If you’re using conda you can
conda env create -f ci/environment-3.6.yml --name=dask-ml-dev
to create a conda environment and install all the dependencies. Note there is
ci/environment-2.7.yml file if you need to use Python 2.7.
If you’re using pip, you can view the list of all the required and optional
setup.py (see the
install_requires field for
required dependencies and
extras_require for optional dependencies). You’ll
at least need the build dependencies of NumPy, setuptools, setuptools_scm, and
The library has some C-extensions, so installing is a bit more complicated than normal. If you have a compiler and everything is setup correctly, you should be able to install Cython and all the required dependencies.
From within the repository:
python setup.py build_ext --inplace
python -m pip install -e ".[dev]"
If you have any trouble with the build step, please open an issue in the dask-ml issue tracker.
Dask-ml uses py.test for testing. You can run tests from the main dask-ml directory as follows:
Alternatively you may choose to run only a subset of the full test suite. For example to test only the preprocessing submodule we would run tests as follows:
In addition to running tests, dask-ml verifies code style uniformity with the
pip install flake8 flake8
For the most part, we follow scikit-learn’s API design. If you’re implementing a new estimator, it will ideally pass scikit-learn’s estimator check.
We have some additional decisions to make in the dask context. Ideally
- All attributes learned during
.fitshould be concrete, i.e. they should not be dask collections.
- To the extent possible, transformers should support
- If possible, transformers should accept a
columnskeyword to limit the transformation to just those columns, while passing through other columns untouched.
inverse_transformshould behave similarly (ignoring other columns) so that
- Methods returning arrays (like
.predict), should return the same type as the input. So if a
dask.arrayis passed in, a
dask.arraywith the same chunks should be returned.