Contributing¶
Thanks for helping to build dask-ml!
Cloning the Repository¶
Make a fork of the dask-ml repo and clone the fork
git clone https://github.com/<your-github-username>/dask-ml
cd dask-ml
You may want to add https://github.com/dask/dask-ml as an upstream remote
repository.
git remote add upstream https://github.com/dask/dask-ml
Creating an environment¶
We have conda environment YAML files with all the necessary dependencies
in the ci directory. If you’re using conda you can
conda env create -f ci/environment-3.6.yml --name=dask-ml-dev
to create a conda environment and install all the dependencies. Note there is
also a ci/environment-2.7.yml file if you need to use Python 2.7.
If you’re using pip, you can view the list of all the required and optional
dependencies within setup.py (see the install_requires field for
required dependencies and extras_require for optional dependencies). You’ll
at least need the build dependencies of NumPy, setuptools, setuptools_scm, and
Cython.
Building dask-ml¶
The library has some C-extensions, so installing is a bit more complicated than normal. If you have a compiler and everything is setup correctly, you should be able to install Cython and all the required dependencies.
From within the repository:
python setup.py build_ext --inplace
And then
python -m pip install -e ".[dev]"
If you have any trouble with the build step, please open an issue in the dask-ml issue tracker.
Running tests¶
Dask-ml uses py.test for testing. You can run tests from the main dask-ml directory as follows:
py.test tests
Alternatively you may choose to run only a subset of the full test suite. For example to test only the preprocessing submodule we would run tests as follows:
py.test tests/preprocessing
In addition to running tests, dask-ml verifies code style uniformity with the
flake8 tool:
pip install flake8
flake8
Conventions¶
For the most part, we follow scikit-learn’s API design. If you’re implementing a new estimator, it will ideally pass scikit-learn’s estimator check.
We have some additional decisions to make in the dask context. Ideally
- All attributes learned during
.fitshould be concrete, i.e. they should not be dask collections. - To the extent possible, transformers should support
numpy.ndarraypandas.DataFramedask.Arraydask.DataFrame
- If possible, transformers should accept a
columnskeyword to limit the transformation to just those columns, while passing through other columns untouched.inverse_transformshould behave similarly (ignoring other columns) so thatinverse_transform(transform(X))equalsX. - Methods returning arrays (like
.transform,.predict), should return the same type as the input. So if adask.arrayis passed in, adask.arraywith the same chunks should be returned.