-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Allow pure numpy array (not dask array) as inputs #90
base: main
Are you sure you want to change the base?
Conversation
@mrocklin @pentschev I just added one test for now. If it is ok, could you please suggest which other tests I should add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxiongshu I added a few requests to make the code easier and more Dask-like, also a few questions on things that aren't clear to me. Please take a look when you have a moment.
@@ -11,7 +11,7 @@ | |||
from scipy.optimize import fmin_l_bfgs_b | |||
|
|||
|
|||
from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client | |||
from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client, safe_zeros_like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is safe_zeros_like
coming from? I suppose you wanted to from dask.array.utils import zeros_like_safe
instead, from https://github.com/dask/dask/blob/48a4d4a5c5769f6b78881adeb1b3973a950e5f43/dask/array/utils.py#L350
if isinstance(X, da.Array): | ||
return np.zeros_like(X._meta, shape=shape) | ||
return np.zeros_like(X, shape=shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(X, da.Array): | |
return np.zeros_like(X._meta, shape=shape) | |
return np.zeros_like(X, shape=shape) | |
return zeros_like_safe(meta_from_array(X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to from dask.array.utils import meta_from_array
at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, I think I might misunderstand our other conversion. #89 (comment)
This PR intends to enable dask-glm
to deal with pure numpy arrays
. Please let me know if not so and dask-glm
should only accept dask arrays
.
dask-glm/dask_glm/algorithms.py
Lines 100 to 101 in 7b2f85f
beta = np.zeros_like(X._meta, shape=p) | |
Let's say the input X
is a pure numpy
or cupy
array, not a dask array. beta = np.zeros_like(X._meta)
will be an error. The safe_zeros_like
(bad naming) I implemented will check if X
is a pure numpy/cupy array
or a dask array
and return a pure numpy/cupy array
. In contrast, da.utils.zeros_like_safe
returns a dask array
. In this case the beta should be a pure numpy/cupy array
.
Let me know if this clears things up. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
safe_zeros_like
(bad naming) I implemented will check ifX
is apure numpy/cupy array
or adask array
and return apure numpy/cupy array
.
That's exactly what meta_from_array
does. It will return an array of the type _meta
has (i.e., chunk type), so if the input is a NumPy array or a Dask array backed by NumPy, the result is an empty numpy.ndarray
, and if the input is a CuPy array or a Dask array backed by CuPy, the result is an empty cupy.ndarray
.
In contrast,
da.utils.zeros_like_safe
returns adask array
.
That isn't necessarily true, it will only return a Dask array if the reference array is a Dask array. Because we're getting the underlying chunk type with meta_from_array
, the resulting array will be either a NumPy or CuPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, that works! I will make the changes.
@@ -149,6 +149,11 @@ def add_intercept(X): | |||
return X_i | |||
|
|||
|
|||
@dispatch(object) | |||
def add_intercept(X): | |||
return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1) | |
return np.concatenate([X, ones_like_safe(X, shape=(X.shape[0], 1))], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs from dask.array.utils import ones_like_safe
.
X, y = make_classification(n_samples=100, n_features=5, chunksize=10, is_sparse=is_sparse) | ||
if is_numpy: | ||
X, y = dask.compute(X, y) | ||
lr = LogisticRegression(fit_intercept=fit_intercept) | ||
lr.fit(X, y) | ||
lr.predict(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this test. When is is_numpy
the case in a real-world example, IOW, will you ever have X
and y
be pure NumPy arrays that's worth testing with LogisticRegression
? I assumed you'd only have Dask arrays (backed by Sparse or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I tried to do, where both X
and y
are pure numpy/cupy arrays
. Is that a feature we want? The current dask-glm
only accepts dask arrays
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a feature we need to support explicitly, I believe anybody using dask-glm
would want to use Dask arrays rather than pure NumPy/CuPy ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'll prioritize #89 then.
Currently
dask_glm.estimators
only acceptsdask.array
as inputs due to the line below and other places where._meta
is accessed without checking the data type.dask-glm/dask_glm/estimators.py
Line 67 in 7b2f85f
dask-glm/dask_glm/utils.py
Lines 120 to 124 in 7b2f85f
Click to see the example code and error
Code:
Error:
This PR allows numpy arrays (not dask numpy array) as input directly.