Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow pure numpy array (not dask array) as inputs #90

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

daxiongshu
Copy link
Contributor

@daxiongshu daxiongshu commented Oct 29, 2020

Currently dask_glm.estimators only accepts dask.array as inputs due to the line below and other places where ._meta is accessed without checking the data type.

if is_dask_array_sparse(X):

dask-glm/dask_glm/utils.py

Lines 120 to 124 in 7b2f85f

def is_dask_array_sparse(X):
"""
Check using _meta if a dask array contains sparse arrays
"""
return isinstance(X._meta, sparse.SparseArray)

Click to see the example code and error

Code:

from dask_glm.estimators import LogisticRegression
import numpy
x = numpy.random.rand(10,4)
y = numpy.random.rand(10)

lr = LogisticRegression()
lr.fit(x,y)

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-e644bf405118> in <module>
----> 1 lr.fit(x,y)

~/rapids/daskml_cupy/dask-glm/dask_glm/estimators.py in fit(self, X, y)
     65         X_ = self._maybe_add_intercept(X)
     66         fit_kwargs = dict(self._fit_kwargs)
---> 67         if is_dask_array_sparse(X):
     68             fit_kwargs['normalize'] = False
     69 

~/rapids/daskml_cupy/dask-glm/dask_glm/utils.py in is_dask_array_sparse(X)
    122     Check using _meta if a dask array contains sparse arrays
    123     """
--> 124     return isinstance(X._meta, sparse.SparseArray)
    125 
    126 

AttributeError: 'numpy.ndarray' object has no attribute '_meta'

This PR allows numpy arrays (not dask numpy array) as input directly.

@daxiongshu
Copy link
Contributor Author

@mrocklin @pentschev I just added one test for now. If it is ok, could you please suggest which other tests I should add numpy input? Thank you!

@daxiongshu
Copy link
Contributor Author

daxiongshu commented Oct 29, 2020

I think I'm going to finish this first and then move on to #89
Not really. I'll move on to #89

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daxiongshu I added a few requests to make the code easier and more Dask-like, also a few questions on things that aren't clear to me. Please take a look when you have a moment.

@@ -11,7 +11,7 @@
from scipy.optimize import fmin_l_bfgs_b


from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client
from dask_glm.utils import dot, normalize, scatter_array, get_distributed_client, safe_zeros_like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is safe_zeros_like coming from? I suppose you wanted to from dask.array.utils import zeros_like_safe instead, from https://github.com/dask/dask/blob/48a4d4a5c5769f6b78881adeb1b3973a950e5f43/dask/array/utils.py#L350

Comment on lines +216 to +218
if isinstance(X, da.Array):
return np.zeros_like(X._meta, shape=shape)
return np.zeros_like(X, shape=shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(X, da.Array):
return np.zeros_like(X._meta, shape=shape)
return np.zeros_like(X, shape=shape)
return zeros_like_safe(meta_from_array(X))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also need to from dask.array.utils import meta_from_array at the top.

Copy link
Contributor Author

@daxiongshu daxiongshu Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I think I might misunderstand our other conversion. #89 (comment)

This PR intends to enable dask-glm to deal with pure numpy arrays. Please let me know if not so and dask-glm should only accept dask arrays.

beta = np.zeros_like(X._meta, shape=p)

Let's say the input X is a pure numpy or cupy array, not a dask array. beta = np.zeros_like(X._meta) will be an error. The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array. In contrast, da.utils.zeros_like_safe returns a dask array. In this case the beta should be a pure numpy/cupy array.

Let me know if this clears things up. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safe_zeros_like (bad naming) I implemented will check if X is a pure numpy/cupy array or a dask array and return a pure numpy/cupy array.

That's exactly what meta_from_array does. It will return an array of the type _meta has (i.e., chunk type), so if the input is a NumPy array or a Dask array backed by NumPy, the result is an empty numpy.ndarray, and if the input is a CuPy array or a Dask array backed by CuPy, the result is an empty cupy.ndarray.

In contrast, da.utils.zeros_like_safe returns a dask array.

That isn't necessarily true, it will only return a Dask array if the reference array is a Dask array. Because we're getting the underlying chunk type with meta_from_array, the resulting array will be either a NumPy or CuPy array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that works! I will make the changes.

@@ -149,6 +149,11 @@ def add_intercept(X):
return X_i


@dispatch(object)
def add_intercept(X):
return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return np.concatenate([X, np.ones_like(X, shape=(X.shape[0], 1))], axis=1)
return np.concatenate([X, ones_like_safe(X, shape=(X.shape[0], 1))], axis=1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs from dask.array.utils import ones_like_safe.

X, y = make_classification(n_samples=100, n_features=5, chunksize=10, is_sparse=is_sparse)
if is_numpy:
X, y = dask.compute(X, y)
lr = LogisticRegression(fit_intercept=fit_intercept)
lr.fit(X, y)
lr.predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this test. When is is_numpy the case in a real-world example, IOW, will you ever have X and y be pure NumPy arrays that's worth testing with LogisticRegression? I assumed you'd only have Dask arrays (backed by Sparse or not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I tried to do, where both X and y are pure numpy/cupy arrays. Is that a feature we want? The current dask-glm only accepts dask arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a feature we need to support explicitly, I believe anybody using dask-glm would want to use Dask arrays rather than pure NumPy/CuPy ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll prioritize #89 then.

@daxiongshu daxiongshu changed the title [WIP] Allow numpy array (not dask array) as inputs [WIP] Allow pure numpy array (not dask array) as inputs Nov 11, 2020
Base automatically changed from master to main February 10, 2021 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants