Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is the role of sparse DMatrix constructors? #160

Open
dmoliveira opened this issue Jan 13, 2023 · 25 comments
Open

what is the role of sparse DMatrix constructors? #160

dmoliveira opened this issue Jan 13, 2023 · 25 comments

Comments

@dmoliveira
Copy link

dmoliveira commented Jan 13, 2023

I have been using version v1.1.1 for some time now and have been able to successfully load large amounts of data (5-10 GB) into memory using XGBoost and DMatrix, as I have ample RAM to do so. However, after updating to v2.1.1, I am experiencing an 'Out-of-Memory' exception when attempting to load significantly smaller amounts of data (200 MB-700 MB). I am concerned that this update has made the library less efficient and effective for training large models. This issue is a major obstacle for real-world usage and requires immediate attention and resolution. Can you please take a look at this and provide a fix? @ExpandingMan @aviks Thank you.

Julia Version: 1.8
Machine RAM: 64GB
Data Size (MB): 700 MB
Data Stats: X:(1705844, 29996), Y:(1705844,)
Data Format: LibSVM Format

The error is when executing the DMatrix conversion using a SparseMatrixCSC{Float32, Int32}

@info "2/4. Transforming to dmatrix ..."
dtrain = XGBoost.DMatrix(X_train, y_train)

Error Message:

ERROR: LoadError: OutOfMemoryError()
@dmoliveira dmoliveira changed the title Urgent: Memory Error with DMatrix in XGBoost v2.1.0 - Need Fix Urgent: Memory Error with DMatrix in XGBoost v2.1.1 - Need Fix Jan 13, 2023
@ExpandingMan
Copy link
Collaborator

The bad news is that it's probably converting to a dense matrix and running out of memory. The "good" news is that I don't think that libxgboost actually does support sparse matrices which would mean that this is not a bug but intended behavior.

I believe that, when sparse formats are created via XGDMatrixCreateFromCSCEx it's actually interpreting the zeros as missing, not 0. Therefore, passing a SparseMatrixCSC to this function results in a DMatrix where the 0s are replaced by nulls. We had an extensive discussion about this here.

That's not to say there is no feature to be added here: we ought to have something for calling XGDMatrixCreateFromCSCEx, I just don't quite know what that would look like. It doesn't seem likely that anyone is going to create an implementation of SparseMatrixCSC with missing instead of 0 anytime soon, so maybe we should just expose low-level functions for special cases.

If you are up for experimentation you can see the deleted method here.

@trivialfis can you confirm that I'm getting this right? I think what I'm saying is consistent with our earlier discussion, but I don't think we explicitly mentioned XGDMatrixCreateFromCSCEx when we had it.

@ExpandingMan
Copy link
Collaborator

I've been thinking a bit, and it seems to me that there are likely quite a few situations in which treating 0's as nulls in a sparse dataset for the purpose of training an xgboost model might actually make some sense. This would explain how you seem to have been content with whatever was happening in v1.1.1 which might have been exactly this.

Perhaps one of the things we need to do here is add a new function which creates a DMatrix from a SparseMatrixCSC, null0sparse or something like that.

I'm sure trivialfis has some thoughts on this (but one ping is enough 🙂 ).

@dmoliveira
Copy link
Author

@ExpandingMan, thanks for bringing up the question about the handling of 0s in the parsing matrix. From my previous experience and comparing various implementations, such as XGBoost in Julia, Python XGBoost, LightGBM, and CatBoost, I've found that performance is on par when using version 1.1.1. In my experience, for datasets with millions of rows and mostly sparse and categorical data, it's fine to ignore the 0s as I can fit huge data and have the benefit from it. One potential solution could be to assign a different internal representation for 'explicit' 0s and ignore the missing data.

I'll give your suggestion a try and run some baselines using the same data in Python to compare quality performance (e.g., precision, etc). If the results are satisfactory, I'd propose this fix as one option. Although I'm not very familiar with the C bindings of the library, I have some experience with it in general. I'll take a look at the code and see if I can incorporate this fix there. Thanks for your input on this issue!

@dmoliveira
Copy link
Author

Great news, @ExpandingMan! Your suggestion worked. I implemented the code you suggested and was able to load more than 1.7M rows in the DMatrix using version 1.1.1 as expected.

I'll now proceed to test the quality of the model. If it is consistent with the previous metrics, I'm happy with this solution as it appears to be effective.

image

@trivialfis
Copy link
Member

trivialfis commented Jan 13, 2023

I believe that, when sparse formats are created via XGDMatrixCreateFromCSCEx it's actually interpreting the zeros as missing, not 0. Therefore, passing a SparseMatrixCSC to this function results in a DMatrix where the 0s are replaced by nulls. We had an extensive discussion about this dmlc/xgboost#8459.

Hmm, I need to write a document on the behavior and get some reviews from all participants.

I think it's not that complicated, it's just my presentation got unnecessary complications into it. Assuming this is a CSC matrix with 0 as valid (non-missing) value:

indptr = [0, 3, 5]
data = [0, 1, 2, 3, 4]
row_idx = [0, 1, 2, 0, 2]

so:

[0, 3,
 1, missing,
 2, 4]

The corresponding xgboost.DMatrix is a CSR with exactly the same entries. 0 is there, missing is NOT there. Nothing is changed, the data used inside DMatrix is exactly the same as the CSC passed in.

@trivialfis
Copy link
Member

let me unify the csc input with csr. I believe that's where the confusion comes in. The csc api is legacy.

@ExpandingMan
Copy link
Collaborator

ExpandingMan commented Jan 13, 2023

Thanks @trivialfis . My main question here was more about how xgboost treats a CSC (or even a CSR) input. In every CSC matrix implementation I'm aware of, the non-explicit elements, i.e. those not explicitly included in the values array, are 0 and this is not changeable because their main purpose is matrix multiplication the algorithms for which have huge simplifications when this is the case. It seems to me that, both for CSC and CSR, xgboost treats these non-explicit elements as null (which I believe means that the tree algorithm simply never splits on them) which is different than 0.

The second part of this, which occurred to me last night, is that it actually might make some sense to treat 0's as missing when training an xgboost model in cases in which the data is very sparse. My reasoning is basically "there are so many 0's, don't try to split on them because all your meaningful signal is coming from the non-zeros". If that's the case, then the CSC method could be quite useful, particularly since this is the most common type of sparse matrix in Julia. The only catch is that the method for using it must not be DMatrix, as this implies that the 0's remain 0's, but we could always add a new function.

Incidentally, I've recently had to do some machine learning on sparse data, and it can be pretty challenging, so now I'm interested in experimenting with the "treat the 0's as missing" idea.

@dmoliveira
Copy link
Author

dmoliveira commented Jan 16, 2023

Here is an update from my side. I tried to use the commented code at dmatrix.jlfile:

function DMatrix(x::SparseMatrixCSC{<:Real,<:Integer}; kw...)
    o = Ref{DMatrixHandle}()
    (colptr, rowval, nzval) = _sparse_csc_components(x)
    xgbcall(XGDMatrixCreateFromCSCEx, colptr, rowval, nzval,
            size(colptr,1), nnz(x), size(x,1),
            o,
           )
    DMatrix(o[]; kw...)
end

Good news: I could load huge sparse data into it and train my models.
Bad news: The results are far behind using v2.1.1 than v1.1.1 (version that was using with this data)

Below is the data summarizing my runs:

image

Metric P means Precision@N, for N=1,5, 10, and 20.
Param Ite means NumRound parameter (num of trees added).
The baseline is my model with XGB v1.1.1 with similar parameters.
The other runs are using XGB v2.1.1 with different iterations number.

So, is this a problem produced by the way that we are reading the sparse data in the new DMatrix or has some other side effect from version 2.1.1 that we need to consider? Maybe I'm missing something about how to read the sparse data into the DMatrix correctly.

@ExpandingMan
Copy link
Collaborator

I don't see what could be causing this other than the data getting into the DMatrix incorrectly.

You can now get data back out of the DMatrix by accessing its elements or using XGBoost.getdata. You should see that something is wrong with your matrix. I can't think of anything else that could possibly cause it, once it goes into the DMatrix the wrapper isn't doing anything to it anymore.

@dmoliveira
Copy link
Author

Ok. Let me investigate more about the differences between these two versions and post them here.

@trivialfis
Copy link
Member

trivialfis commented Jan 16, 2023

Apologies for the slow response.

I think this is making things too complicated. DMatrix is real simple:

csr = load_data() # a csr matrix loaded by your library code, same for dense
# internal DMatrix is doing this
class DMatrix:
    """I'm a CSR matrix called DMatrix, with parameter `missing`, constants `NaN` and `Inf` removed`."""
    def __init__(self, missing: float):
        self.missing = missing
        self.indptr = []
        self.data = []
        self.indices = []

   def load_csr(self, csr):
        for i, v in enumerate(csr.data):
            if v not in (self.missing, NaN, Inf):
                self.data.append(v)
                self.indices.append(csr.indices[i])
        self.handle_indptr()

For CSC matrix before dmlc/xgboost#8672 can be merged:

csc = load_data() # a csc matrix loaded by your library code
# internal DMatrix is doing this
class DMatrix:
    """I'm a CSR matrix called DMatrix, with constants `NaN` and `Inf` removed`."""
    def __init__(self):
        self.indptr = []
        self.data = []
        self.indices = []

   def load_csc(self, csc):
        for i, v in enumerate(csc.data):
            if v not in (NaN, Inf):
                self.data.append(v)
                self.indices.append(csc.indices[i])
        self.handle_indptr()
        self.transform_to_csr()

After the PR, it's the same as CSR and dense.

@dmoliveira
Copy link
Author

Thank you @ExpandingMan and @trivialfis for your feedback and support. The DMatrix appears to be working correctly.

Sample - Original Sparse SVMLight Data:

image

Sample - DMatrix Representation:
image

I am using a Pairwise optimization method in version 1.1.1 that utilizes group information. I will check if this information is being correctly passed to the DMatrix using the setgroup function in the new version. Another option is to simplify the configuration and compare the results between the two versions. I will keep you updated and consider any additional suggestions.

@dmoliveira
Copy link
Author

Research Update

Score Model Analysis

I have been running a comparison of two distinct versions of XGBoost, v.1.1.1 and v2.1.1, using the same model parameters and data. I have plotted the model score for two different classes (blue=negative, orange=positive) and have found that having the two classes far from each other and minimal overlapping is better in my case and directly related to better results in the offline evaluation.

As a result, I have found that v1.1.1 is able to do a better job of splitting the data and consequently in the offline evaluation. I am currently investigating whether this difference is due to how we use the DMatrix or add Parameters before starting the training in the new versions.

XGBoost v1.1.1
image

XGBoost v2.1.1
image

Any thoughts or suggestions on this issue are welcome.

@dmoliveira
Copy link
Author

To keep the thread fresh. I am currently investigating the differences between versions v2.1.1 and v1.1.1. Despite conducting multiple experiments, I have been unable to achieve the same performance as v1.1.1. In my offline tests, v2.1.1 has a performance that is 10 times lower than v1.1.1 (e.g., 0.01 x 0.001 for P@5). Additionally, I have not identified any significant differences in the creation of DMatrix using both the old and new code. At this point, it appears that the issue may not be related to treating 0 as a missing value or differences in the DMatrix implementation but rather something else. I will continue to conduct further experiments and will update this post with any relevant findings.

@ExpandingMan
Copy link
Collaborator

It seems like there is a lot of confusion here. First, I have not tested the sparse DMatrix constructors on my end, but I have not otherwise seen any evidence of a problem in latest tagged.

@dmoliveira this is only actionable if you can show some evidence of what is causing the discrepancy. You've showed detailed results but you need to provide a complete working example with data in order for anything to happen here. I frankly have no idea what to make of what you have thus far shown.

@trivialfis I think we are talking past each other somewhat. I understand how to construct a DMatrix using any of the constructors in the library, what I am unsure about is what xgboost does with it. Does it treat the non-explicit entries (anything not in data in your example above) as ordinary 0's the same as if they were passed in a dense matrix, or does it filter them out of the decision tree process entirely? The former is consistent with the structure of those matrices as they exist in any standard implementation (including all the Julia and Python implementations I am aware of) while the latter is not, but may still be useful for us here.

@ExpandingMan
Copy link
Collaborator

I've just tested some constructors. When converting the DMatrix constructed via XGDMatrixCreateFromCSCEx back to a Matrix (via Matrix(::DMatrix) on latest master) it does indeed look like the non-explicit elements are being converted to missing, which I assume means they are ignored during training. This would seem to answer my above question to @trivialfis .

At the moment I can only speculate on the wisdom of actually using this as a technique to train on sparse data. Again, I can see how simply ignoring the 0's might be a reasonable thing to do if your data is very sparse. This approach would not be without major caveats, in particular data points in which every dimension is 0 provide no information, e.g. you couldn't have a random sparsity pattern of the type returned by sprand.

@ExpandingMan ExpandingMan changed the title Urgent: Memory Error with DMatrix in XGBoost v2.1.1 - Need Fix what is the role of sparse DMatrix constructors? Jan 25, 2023
@trivialfis
Copy link
Member

trivialfis commented Jan 28, 2023

Your observation is correct. It's filtered out. If one wants the 0s to be used during training, a dense matrix should be preferred. There was a proposal to restore the zero elements in the input sparse matrix but we opt to not implement it due to code complexity.

In order for xgboost to restore the 0s, there are 2 options:

  • Restore the 0s in DMatrix and QuantileDMatrix, which might defeat the purpose of using sparse matrix input in the first place as the saved memory by sparse input is now consumed by DMatrix. One might just use dense input instead.
  • Generate the 0s on the fly. This will increase code complexity significantly and both CPU and GPU (especially GPU), and for all the code that uses data. We can look for some abstractions to handle this in some specific use cases. For instance, we might be able to do it with hist and gpu_hist in the future but not for others.

Let me know if you think this should be prioritized. I can open an issue to track it.

@ExpandingMan
Copy link
Collaborator

Thanks, that clears things up!

I don't really have an opinion on this right now. Like I said, I'm still curious about the idea that maybe omitting 0's from sparse data actually makes sense for training in some special cases.

I suppose the bigger question is whether xgboost is likely to do a good job on sparse data with the 0's included. If yes, I think it would be really nice to have an implementation that respects the sparse matrix structure.

I also think it will be hard to gauge how much demand there really is for this. I suspect that in most cases, if you have data too big to fit in memory, it's not going to be sparse, but if it does just so happen to be sparse, the existence of sparse matrix constructors can potentially save you an enormous amount of effort.

@dmoliveira
Copy link
Author

dmoliveira commented Feb 7, 2023

A small repository has been created (https://github.com/dmoliveira/xgboost-benchmark) that contains a single file to demonstrate the issue and provide the necessary data. The script relies solely on the XGBoost library, which can be installed based on the version (version 1.5.2 and 2.2.3 were used in the tests). The tests were conducted using Julia v1.8.

This repository contains a simple learning-to-rank task aimed at properly ranking documents. For some reason, the performance in metrics fluctuates depending on when the prediction metric is called. The cause of this issue is still unknown, but version 1 of the library appears to be more stable. The results show that when plotted, the prediction quality decreases, particularly for relevant versus irrelevant documents in lower and higher scores.

Ranking Score Expectation
image

Ranking Score Results for V2 depending on when you call predict
image

Results Run XGBoost v1.5.2
VERSION=1.5.2 ITE=100 ./run_xgb_experiment.jl

[ Info: (1) TRAIN - Precision@N: p@5:0.87714 p@10:0.85536 p@20:0.83332
[ Info: (2) TEST - Precision@N: p@5:0.88097 p@10:0.85939 p@20:0.83138

Invert Test and Train prediction call order for Evaluation

[ Info: (1) TRAIN - Precision@N: p@5:0.88097 p@10:0.85939 p@20:0.83138
[ Info: (2) TEST - Precision@N: p@5:0.87714 p@10:0.85536 p@20:0.83332

Results Run XGBoost v2.2.3
VERSION=2.2.3 ITE=100 ./run_xgb_experiment.jl

[ Info: (1) TRAIN - Precision@N: p@5:0.853 p@10:0.83667 p@20:0.82057
[ Info: (2) TEST - Precision@N: p@5:0.87873 p@10:0.85398 p@20:0.83702

Invert Test and Train prediction call order for Evaluation

[ Info: (1) TRAIN - Precision@N: p@5:0.54813 p@10:0.53545 p@20:0.51238
[ Info: (2) TEST - Precision@N: p@5:0.88808 p@10:0.87128 p@20:0.85495

@dmoliveira
Copy link
Author

dmoliveira commented Feb 7, 2023

Full code used to generate the results.

#!/usr/bin/env julia

# -- Import basic libraries for test
using Pkg
using SparseArrays

# -- Auxiliary Functions for test

function getdata(filename::String; rows::Int=-1)::Tuple{SparseMatrixCSC, Vector{Float32}, Vector{Int}}
    """Get feature as sparse matrix, labels and group ids after reading a SVMLight formatted file"""
    I, J, V, ys, qids = Int[], Int[], Float32[], Int[], Int[]
    for (i, rawrow) in enumerate(eachline(filename))
        row = split(rawrow, " ")
        push!(ys, parse(Float32, row[1]))
        push!(qids, parse(Int, last(split(row[2], ":"))))
        for f in row[3:end]
            j, v = split(f, ":")
            push!(I, i)
            push!(J, parse(Int, j))
            push!(V, parse(Float32, v))
        end
        rows > 0 && length(ys) >= rows && break
    end
    sparse(I, J, V), ys, countqids(qids)
end

function countqids(qids::Vector{Int})::Vector{Int}
    """Return a count of qids. Example: qid input '[150, 150, 21, 21, 5]' return '[2, 2, 1]'."""
    last_qid, counts = first(qids) , Int[0]
    for qid in qids
        if last_qid == qid
            counts[end] += 1
        else
            push!(counts, 1)
            last_qid = qid
        end
    end
    counts
end

function setgroup(dmatrix, qids::Vector{Int})
    """Set groups 'qids' to DMatrix"""
    group = convert(Vector{UInt32}, qids)
    group_size = convert(UInt64, size(group, 1))
    XGBoost.XGDMatrixSetUIntInfo(dmatrix.handle, "group", group, group_size)
end

function precision_at_k(y, yhat, qids; k::Int, threshold::Float64=0.0)::Float32
    """Return metric precision at k"""
    score = 0.0
    s = 1
    for qid in qids
        allowed_k = convert(Int, min(qid, k))
        e = s + allowed_k - 1
        hits = sum((yhat[s:e] .> threshold) .== (y[s:e] .> threshold))
        score += hits/allowed_k
        s += qid
    end

    round(score/length(qids), digits=5)
end

function confusion_matrix(ys, yshat, qids; threshold::Float64=0.0)
    """Show confusion matrix"""
    s = 1
    tp, fp, tn, fn = 0, 0, 0, 0
    for qid in qids
        e = s + qid - 1
        for (yhat, y) in zip(yshat[s:e] .> threshold, ys[s:e] .> threshold)
            if yhat == y
                if yhat == 1
                    tp += 1
                else
                    tn += 1
                end
            else
                if yhat == 1
                    fp += 1
                else
                    fn += 1
                end
            end
        end
        s += qid
    end

    precision = round(tp/(tp+fp), digits=5)
    recall = round(tp/(tp+fn), digits=5)
    f1 = round(2 * (precision * recall)/(precision + recall), digits=5)
    accuracy = round((tp+tn)/(tp+fp+tn+fn), digits=5)
    prevalence = round((tp+fn)/(tp+fn+fp+tn), digits=5)
    @info "TP:$tp FP:$fp TN:$fn FN$fn Precision:$precision Recall:$recall F1:$f1 Accuracy:$accuracy Prevalence:$prevalence"
end

function evalmetrics(bst, dtrain, y_train, qids_train, dtest, y_test, qids_test; threshold::Float64=0.0)
    """Evaluate model quality in training and test data using Precision@N metric"""

    yhat_train = XGBoost.predict(bst, dtrain)
    yhat_test  = XGBoost.predict(bst, dtest)

    save_predictions(yhat_train, y_train, "train_prediction_xgb_v$(VERSION).csv")
    save_predictions(yhat_test, y_test, "test_prediction_xgb_v$(VERSION).csv")

    pat5  = precision_at_k(y_train, yhat_train, qids_train, k=5,  threshold=threshold)
    pat10 = precision_at_k(y_train, yhat_train, qids_train, k=10, threshold=threshold)
    pat20 = precision_at_k(y_train, yhat_train, qids_train, k=20, threshold=threshold)
    @info "(1) TRAIN - Precision@N: p@5:$pat5 p@10:$pat10 p@20:$pat20"
    #confusion_matrix(y_train, yhat_train, qids_train; threshold=threshold)

    pat5  = precision_at_k(y_test, yhat_test, qids_test, k=5,  threshold=threshold)
    pat10 = precision_at_k(y_test, yhat_test, qids_test, k=10, threshold=threshold)
    pat20 = precision_at_k(y_test, yhat_test, qids_test, k=20, threshold=threshold)
    @info "(2) TEST - Precision@N: p@5:$pat5 p@10:$pat10 p@20:$pat20"
    #confusion_matrix(y_test, yhat_test, qids_test; threshold=threshold)
end

function ConvertDMatrix(x::SparseMatrixCSC{<:Real,<:Integer}; kw...)
    """Transform sparse matrix 'x' to DMatrix in XGBoost v.2 generation"""
    o = Ref{XGBoost.DMatrixHandle}()
    (colptr, rowval, nzval) = XGBoost._sparse_csc_components(x)
    XGBoost.xgbcall(XGBoost.XGDMatrixCreateFromCSCEx, colptr, rowval, nzval,
                    size(colptr,1), nnz(x), size(x,1), o)
    XGBoost.DMatrix(o[]; kw...)
end

function save_predictions(ypred::Vector, yreal::Vector, filename::String)
    """Save predictions into CSV format with columns 'yreal' and 'ypred'"""
    io = open(filename, "w")
    write(io, "yreal,ypred\n")
    for i=1:length(ypred)
        write(io, "$(yreal[i]),$(ypred[i])" * (i < length(ypred) ? "\n" : ""))
    end
    flush(io)
    close(io)
end
    
# -- Code Execution

# Parameters
VERSION = get(ENV, "VERSION", "1.5.2")
NUM_ITERATIONS = parse(Int, get(ENV, "ITE", "10"))

println("\nXGBoost v1 x v2 Experiment\n")

# Define parameters for XGBoost model
params = Dict(
    "seed" => 1,
    "num_round" => NUM_ITERATIONS,
    "booster" =>"gbtree",
    "objective" => "rank:pairwise",
    "verbosity" => 1,
    "eta" => 0.01,
    "gamma" => 0,
    "max_depth" => 7,
    "min_child_weight" => 1,
    "max_delta_step" => 0,
    "subsample" => 0.9,
    "colsample_bytree" => 0.9,
    "colsample_bylevel" => 0.9,
    "colsample_bynode" => 1.0,
    "lambda" => 1,
    "alpha" => 0,
    "refresh_leaf" => 1,
    "process_type" => "default",
    "tree_method" => "hist",
    "num_parallel_tree" => 1,
    "grow_policy" => "depthwise",
    "max_bin" => 256,
    "predictor" => "auto" )

@info "-- Loading Data..."
x_train, y_train, qids_train = getdata("train.svmlight")
x_test, y_test, qids_test = getdata("test.svmlight")

@info "Train Data X:$(size(x_train)) Y:$(length(y_train)) QIDs:$(length(qids_train))"
@info "Test Data X:$(size(x_test)) Y:$(length(y_test)) QIDs:$(length(qids_test))"
@info "-- ended data loading"; println()

if occursin(r"^1[.]", VERSION) 
    @info "Execute Analysis for XGBoost v$VERSION"
    Pkg.add(Pkg.PackageSpec(name="XGBoost", version=VERSION), io=devnull)
    using XGBoost
    
    dtrain    = XGBoost.makeDMatrix(x_train, y_train)
    dtest     = XGBoost.makeDMatrix(x_test, y_test)
    watchlist = [(dtrain, "train"), (dtest, "eval")]

    setgroup(dtrain, qids_train)
    setgroup(dtest, qids_test)
    
    bst = XGBoost.xgboost(dtrain, NUM_ITERATIONS, metrics=["auc"], watchlist=watchlist, param=params)
    
    println("\nResults Run XGBoost v$VERSION")
    evalmetrics(bst,
                XGBoost.DMatrix(x_train), y_train, qids_train,
                XGBoost.DMatrix(x_test),  y_test,  qids_test,
                threshold=0.0)

    println("\nInvert Test and Train for Evaluation")
    evalmetrics(bst,
                XGBoost.DMatrix(x_test), y_test, qids_test,
                XGBoost.DMatrix(x_train),  y_train,  qids_train,
                threshold=0.0)
    
end

if occursin(r"^2[.]", VERSION) 
    @info "Execute Analysis for XGBoost v$VERSION"
    Pkg.add(Pkg.PackageSpec(name="XGBoost", version=VERSION), io=devnull)
    using XGBoost

    dtrain = ConvertDMatrix(x_train, label=y_train)
    dtest  = ConvertDMatrix(x_test, label=y_test)

    setgroup(dtrain, qids_train)
    setgroup(dtest, qids_test)
    
    params["watchlist"] = Dict("train" => dtrain, "eval" => dtest)
    params["eval_metric"] = "auc"
    kwargs = Dict{Symbol, Any}(Symbol(k) => v for (k,v) in params)
    
    bst = XGBoost.xgboost(dtrain; kwargs...)
    
    println("\nResults Run XGBoost v$VERSION")
    evalmetrics(bst,
                ConvertDMatrix(x_train), y_train, qids_train,
                ConvertDMatrix(x_test),  y_test,  qids_test, threshold=0.0)

    println("\nInvert Test and Train for Evaluation")
    evalmetrics(bst,
                ConvertDMatrix(x_test), y_test, qids_test,
                ConvertDMatrix(x_train),  y_train,  qids_train, threshold=0.0)
end

@dmoliveira
Copy link
Author

@ExpandingMan @trivialfis do you have any idea what could be happening?

@ExpandingMan
Copy link
Collaborator

Empirical observation of the results is really not useful to me here, what I would need to take action would be specifically what is the discrepancy between the elements of the constructed DMatrix and what you would expect from it. It seems highly unlikely that something is going wrong once the data gets into the DMatrix because one would have to explain why it does not manifest itself in every wrapper of xgboost.

The DMatrix can be inspected since whichever patch added XGDMatrixGetDataAsCSR, so it should be possible to see whatever your issue is here, I just have no a priori idea of what that might be based on your posts above.

@dmoliveira
Copy link
Author

@ExpandingMan sure. I attempted some experiments to address the issue, but I concur that we require a more pragmatic approach. Unfortunately, I'm currently pressed for time, but I'm eager to investigate the root cause. As I use these models for production, which caters to millions of users, I'm stuck on version v1 rather than v2. I'll return when I have more availability, and perhaps we can schedule a brief discussion then.

@trivialfis
Copy link
Member

apologies for missing the ping. I can try to reproduce it once I finish some other experiments.

@trivialfis
Copy link
Member

I have tested the xgboost.jl with 2.5.1, the issue seems to be fixed there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants