-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [dask] decide and document how users should provide a Dask client at training time #3808
Comments
I tried to write down a useful summary of the problem above, but it's possible that I've missed things or made it unclear. Apologies if that's the case. @ffineis @hcho3 @trivialfis @StrikerRUS whenever you have time, could you review this and let me know if you have strong opinions or additional information that might help make this decision? I'm going to hold back my personal opinion on this until I've heard others, to not bias the outcome 😀 |
@jameslamb I personally don't like option I saw in official examples scikit-learn uses similar to option
Maybe we can create a voting among Dask users? Special Dask repo for discussions or/and Twitter with appropriate hashtags/mentions?.. |
I think the dasky-est approach would be to use not allow the user to explicitly set the client and instead assume that they have initiated a client already. This is basically Option 2, but I want to clarify that it doesn't necessitate that training be run inside a context manager. It is enough to have already instantiated a client object. I think you'll use Other questions
I would expect it to accept both pandas/numpy and dask collections. This is similar to the behavior of |
Thanks @jsignell ! To be fair, i also just added an Option 4. I forgot that |
I've invited more commentary on dask/community#104 |
I mostly agree with @jsignell , but I do think that allowing an optional argument to the model constructor is a reasonable and useful addition; you might just want to have separate things fitting on different clusters simultaneously. |
I forgot to mention that. Yeah, that could be the way to implement fitting on different clusters. It could be documented in an advanced section. |
Thanks for the comments and suggestions, everyone! I've just opened #3883 with my proposal for this issue. I'm proposing removing the # never telling LightGBM to use a specific client (uses distributed.default_client())
clf = lgb.DaskLGBMClassifier()
clf.fit(X, y)
# keyword arg in constructor
clf = lgb.DaskLGBMClassifier(client=client)
clf.fit(X, y)
# setting an attribute
clf = lgb.DaskLGBMClassifier()
clf.set_params(client = client)
clf.fit(X, y) If you have time and interest, comments on #3883 are welcome. |
Thanks @jsignell and @martindurant for joining this issue and giving us your expertise! I just merged #3883 with the changes described in the comment above. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Summary
The model objects in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/dask.py currently inherit from the classes in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py. For example,
lightgbm.dask.DaskLGBMRegressor
is a child oflightgbm.sklearn.LGBMRegressor
.These Dask estimators perform training on data stored in Dask collections (DataFrame and Array). To do this, they require a Dask client to talk to the cluster that those collections are stored on.
This issue describes the current state and possible alternatives for how LightGBM chooses the client to use for that task.
Option 1 - provide
client
in a keyword arg to.fit()
(current state)In Dask training, users provide a client to the cluster that they'd like to use for training. Currently, this is done in the
.fit()
method (which overrides the parent method from thesklearn
interface).LightGBM/python-package/lightgbm/dask.py
Lines 337 to 339 in 9cc3777
Good properties of this pattern:
.fit()
, soDaskLGBMClassifier
andDaskLGBMRegressor
can be serialized withpickle
/cloudpickle
(I need to confirm that...but I know for sure thatyou cannot pickle a Dask client)Bad properties of this pattern
.fit()
method look different from thesklearn
equivalent, because of the extra parameterclient
Option 2 - don't allow users to customize client
We could choose not to expose
client
at all as a parameter, and just use the default client that's available, for example by asking people to run training in a context manager (shown below), more generally, by usingdistributed.get_client()
Good properties
.fit()
methods identical to thesklearn
equivalents.fit()
, soDaskLGBMClassifier
andDaskLGBMRegressor
can be serialized withpickle
/cloudpickle
(I need to confirm that...but I know for sure that you cannot pickle a Dask client)Bad properties
Option 3 - allow users to set
.client
after constructionIn this option, we could allow users to set
.client
after constructing an estimator, and use that property if it's set. Otherwise, use default client.This is what
xgboost
does in their Dask interface --> https://github.com/dmlc/xgboost/blob/7bc56fa0eda6984197152ed4aa98d2fc9e9e8cc8/python-package/xgboost/dask.py#L171Good properties
.fit()
methods identical to thesklearn
equivalentsBad properties
.fit()
DaskLGBMRegressor
object (for example) can be pickledOption 4 - optionally pass a client in the model object's constructor
This is what
cuml
does for their Dask interface, for example: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.dask.cluster.KMeans. Ifclient = None
, then incuml
you get the default client based on the context.Good properties
Bad properties
Other related questions
This project's Dask interface is very new, and there are other unresolved questions that might be related to this one. Like
.predict()
on Dask estimators only accept inputs that are Dask collections?[will insert issue link here soon]
pandas
ornumpy
)The text was updated successfully, but these errors were encountered: