support for multi-dimensional "label" for regressions? #38

ExpandingMan · 2017-02-03T16:16:40Z

Hello all. I haven't dug too far into the source code yet, but I'm wondering if it's possible to do regressions where the "label" (target value) consists of multi-dimensional data points. (i.e. the label argument of the xgboost function would be an Array{T<:Number,2}.) This seems like a pretty important feature, but I can't find any literature about it in the xgboost documentation for any language.

It seems to me that even if it's not explicitly supported this should be possible by setting a custom loss function, however I get the following error any time I try to pass a matrix-valued "label":

ERROR: LoadError: MethodError: no method matching (::XGBoost.#_setinfo#8)(::Ptr{Void}, ::String, ::Array{Float64,2})
Closest candidates are:
  _setinfo{T<:Number}(::Ptr{Void}, ::String, ::Array{T<:Number,1}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:10
 in (::XGBoost.##call#7#11)(::Array{Any,1}, ::Type{T}, ::Array{Float64,2}, ::Bool, ::Float32) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:59
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{XGBoost.DMatrix}, ::Array{Float64,2}, ::Bool, ::Float32) at ./<missing>:0
 in makeDMatrix(::Array{Float64,2}, ::Array{Float64,2}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:137
 in #xgboost#20(::Array{Float64,2}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Type{T}, ::Type{T}, ::Array{Any,1}, ::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:147
 in (::XGBoost.#kw##xgboost)(::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at ./<missing>:0
 in include_from_node1(::String) at ./loading.jl:488
while loading /home/user/RatingsPrediction/xgboost0.jl, in expression starting on line 43

Taking a look at the source code I get the impression it is not designed to pass labels that aren't Vectors into the C code. Certainly the above error seems to indicate that it is impossible to set a "label" that cannot be converted to Vector.

Is there any way around this? Does the Python API support this? Thanks.

The text was updated successfully, but these errors were encountered:

slundberg · 2017-02-03T16:25:56Z

See https://github.com/dmlc/xgboost/blob/master/doc/parameter.md and the multi:softprob objective for how a vector output would be handled (as a flattened matrix). However a deeper question is what you expect to happen in the gradient boosting regression model with a vector output that is different than running a separate model for each dimension. If you can clarify what you want to be different (other than just easier coding), then it will be easier to see if XGBoost supports that. - Scott

…

On Fri, Feb 3, 2017 at 8:16 AM ExpandingMan ***@***.***> wrote: Hello all. I haven't dug too far into the source code yet, but I'm wondering if it's possible to do regressions where the "label" (target value) consists of multi-dimensional data points. (i.e. the label argument of the xgboost function would be an Array{T<:Number,2}.) This seems like a pretty important feature, but I can't find any literature about it in the xgboost documentation for any language. It seems to me that even if it's not explicitly supported this should be possible by setting a custom loss function, however I get the following error any time I try to pass a matrix-valued "label": ERROR: LoadError: MethodError: no method matching (::XGBoost.#_setinfo#8)(::Ptr{Void}, ::String, ::Array{Float64,2}) Closest candidates are: _setinfo{T<:Number}(::Ptr{Void}, ::String, ::Array{T<:Number,1}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:10 in (::XGBoost.##call#7#11)(::Array{Any,1}, ::Type{T}, ::Array{Float64,2}, ::Bool, ::Float32) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:59 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{XGBoost.DMatrix}, ::Array{Float64,2}, ::Bool, ::Float32) at ./<missing>:0 in makeDMatrix(::Array{Float64,2}, ::Array{Float64,2}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:137 in #xgboost#20(::Array{Float64,2}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Type{T}, ::Type{T}, ::Array{Any,1}, ::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:147 in (::XGBoost.#kw##xgboost)(::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at ./<missing>:0 in include_from_node1(::String) at ./loading.jl:488 while loading /home/user/RatingsPrediction/xgboost0.jl, in expression starting on line 43 Taking a look at the source code I get the impression it is not designed to pass labels that aren't Vectors into the C code. Certainly the above error seems to indicate that it is impossible to set a "label" that cannot be converted to Vector. Is there any way around this? Does the Python API support this? Thanks. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#38>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADkTxR-9mV7wWqzIJmMPFQo25l-uaQhBks5rY1LogaJpZM4L2gs5> .

ExpandingMan · 2017-02-03T17:49:00Z

Thanks for your prompt response.

I don't see any significant problem with using multiple models (as far as I can think, in the case of gradient boosted trees this should be exactly equivalent to "one" multi-dimensional model). Of course, one usually doesn't have to resort to this (from an API standpoint), hence the issue. Apart from convenience, I'd be a bit concerned about performance issues if I were fitting in a high-dimensional space, but perhaps that's unwarranted.

slundberg · 2017-02-03T17:56:59Z

Deep learning API's often allow vector output because they share parameters during such multitask learning. My guess is since GBM's don't typically do this, running separate models is the most explicit way of doing this without implying that any parameter sharing is happening. I think Tianqi wrote a paper with Carlos a while back on accounting for certain types of dependence among the output features, so you might also check that out if you want. - Scott

…

On Fri, Feb 3, 2017 at 9:49 AM ExpandingMan ***@***.***> wrote: Thanks for your prompt response. I don't see any significant problem with using multiple models (as far as I can think, in the case of gradient boosted trees this should be exactly equivalent to "one" multi-dimensional model). Of course, one usually doesn't have to resort to this (from an API standpoint), hence the issue. Apart from convenience, I'd be a bit concerned about performance issues if I were fitting in a high-dimensional space, but perhaps that's unwarranted. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADkTxcgs7lxRJCWNUCMsUdKnn-edWr3zks5rY2iNgaJpZM4L2gs5> .

mangolzy · 2022-10-13T02:57:40Z

I have a related confuse, according to some out-of-date documentation, eg:
https://xgboost.readthedocs.io/en/release_0.72/python/python_api.html
label ([list] or numpy 1-D array, optional) – Label of the training data.
it seems only 1-D array is accepted as label for construction of matrix.
but from the newly created version,
https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.training
label (array_like) – Label of the training data.
the form of label is of no limit, and we could pass a 2-D array as label that's true, but a strange thing come out, that when we use dmatrix.get_label() to look at this 2-D array, it seems the underground process has done a flatten and just keep the first "sample length" elements, like this:

X = pd.DataFrame(data=[[1,0], [2,2], [0,3], [4,4]])
y = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
dsoft_fake = xgb.DMatrix(X.values, label=y)
dsoft_fake.get_label()

output:

array([1., 2., 3., 4.], dtype=float32)

so my question is,

if 2-D array is accepted for label, how it should be use correctly under which circumstance, or for solving what kind of problem?
or if we do want to set the label of one sample point as vector, which can be consider as a soft label consists of different probabilities for different classes(>2), and they sum up to 1, is xgboost support this feature now? in this case, i don't think separate model for each dimension is suitable

thanks for explanation in advance

trivialfis · 2022-10-13T04:37:23Z

The matrix input for labels is a recent addition (1.6) for multi-output and multi-label, the getter hasn't been able to return the matrix yet.

ExpandingMan mentioned this issue Feb 9, 2017

compliance with ScikitLearn API #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for multi-dimensional "label" for regressions? #38

support for multi-dimensional "label" for regressions? #38

ExpandingMan commented Feb 3, 2017

slundberg commented Feb 3, 2017 via email

ExpandingMan commented Feb 3, 2017

slundberg commented Feb 3, 2017 via email

mangolzy commented Oct 13, 2022 •

edited

Loading

trivialfis commented Oct 13, 2022

support for multi-dimensional "label" for regressions? #38

support for multi-dimensional "label" for regressions? #38

Comments

ExpandingMan commented Feb 3, 2017

slundberg commented Feb 3, 2017 via email

ExpandingMan commented Feb 3, 2017

slundberg commented Feb 3, 2017 via email

mangolzy commented Oct 13, 2022 • edited Loading

trivialfis commented Oct 13, 2022

mangolzy commented Oct 13, 2022 •

edited

Loading