docs: clarification on DMatrix CSR return #8459

ExpandingMan · 2022-11-11T18:23:45Z

I'm in the process of implementing the new function XGDMatrixGetDataAsCSR in the XGBoost.jl Julia wrapper.

I'm a bit confused about what this is supposed to be returning in the presence of null values. Values passed to the DMatrix as NaN32 do not show up in the sparse matrix, but values passed as 0.0f0 are there. This raises the question of what is the default value of the resulting sparse matrix: is it null or zero?

Both options seem to present problems:

If the former, there is a major bug in the Python implementation which returns a scipy.sparse.csr_matrix which has default values of 0.
If the latter, either the DMatrix constructors or the matrix returned by XGDMatrixGetDataAsCSR is faulty because NaN values are getting converted to 0's somewhere.

Can someone provide clarification on the intended behavior? Thanks!

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-11-12T22:15:13Z

Hi, DMatrix constructor has a parameter called missing, which is defaulted to NaN. Any entries with missing value will be removed during the construction of DMatrix.

This raises the question of what is the default value of the resulting sparse matrix: is it null or zero?

It's the argument passed to missing. Any floating point number is valid.

If the former, there is a major bug in the Python implementation which returns a scipy.sparse.csr_matrix which has default values of 0.

from scipy import sparse
import xgboost as xgb
from numpy import testing
import numpy as np

values = [0.0, 1.0, 0.0]
indptr = [0, 3]
indices = [0, 1, 2]
csr0 = sparse.csr_matrix((values, indices, indptr), shape=(1, 3))

# 0s are preserved
testing.assert_allclose(csr0.data, np.array(values))
X = xgb.DMatrix(csr0, missing=np.nan)
csr1 = X.get_data()
# 0s are preserved
testing.assert_allclose(np.array(values, dtype=np.float32), csr1.data)

ExpandingMan · 2022-11-13T17:31:49Z

I'm a bit confused. The default value for sparse.csr_matrix is always 0, is it not? If that's the case, the returned csr_matrix will not match the input data unless the default value passed on construction is 0. Indeed it looks to me like in the code you provided you are only comparing the non-zero value data and not checking the elements of the matrix which are not given explicitly as part of values.

Also, is there a way to retrieve the default value from the DMatrix? I don't see a function that does that in the API.

Addendum: The documentation for csr_matrix also does not seem to give any indication that there is a way to change the default value from 0.

trivialfis · 2022-11-15T02:45:45Z

Let's distinguish "zero" and "missing". zero means the real number 0 on the number line, while missing means the omitted values in a sparse matrix, which may or may not be 0. Another way of saying missing is "unknown" or "invalid". With a dense matrix, one can represent the concept of "unknown" using any value, NaN, None, Inf, 0, -1, or whatever the practitioner sees fit. It's a placeholder/symbol. While for a sparse matrix, it's omitted.

For instance, given a dataset that contains a predictor storing the temperature of different locations, 0 is a valid temperature value. For locations where we haven't been able to collect the temperature, we might express it as NaN or Inf in a table:

# We don't know the temperature of location 4, use NaN to represent this piece of missing information.
X = pd.DataFrame(
    {
        "temperature": [0, 20, 13, 4, np.nan],
        "location_code": [0, 1, 2, 3, 4],
        "has_dog": [True, False, False, True, True],
    }
)

m = DMatrix(X, missing=np.nan)

From the above example, whether a value is missing is determined based on whether it represents an unknown value instead of whether it's being 0.

The default value for sparse.csr_matrix is always 0, is it not

It can be something else like NaN as in the example in the previous comment, 0 is a valid value for the csr_matrix and DMatrix in that example.

Indeed it looks to me like in the code you provided you are only comparing the non-zero value data and

I was comparing 0 values as well, you can print out the content of the CSR matrix if you wish. But indeed I'm not comparing the "unknown" value directly, anything that's not in the input csr_matrix is unknown.

not checking the elements of the matrix which are not given explicitly as part of values.

Indeed. But I'm not sure what's there to compare. The known values are correctly returned, and the rest of the values are unknown. They form a partition, if known values are correct, unknown values are also correct.

Also, is there a way to retrieve the default value from the DMatrix?

No. Only the user who created the dataset knows what value should be considered unknown/missing, it's just a symbol with one meaning: we don't know the value in those entries, and the machine learning algorithm should take that into consideration when building the model.

documentation for csr_matrix also does not seem to give any indication that there is a way to change the default value from 0.

Indeed, it's a hidden property. I picked the scipy CSR as the return value only because it's the most used CSR implementation in Python and scipy is already a hard dependency of XGBoost. There can be better choices with an option for the user to specify what's valid and what's not instead of obeying the tradition of using 0. Since you are using Julia, you can safely ignore the scipy quirks.

In short, just take DMatrix as a generalized CSR matrix that a user can specify what value to remove.

# A dense matrix with shape `(1, 6)` where `-1` means that the entry is invalid/unknown/missing.
array = np.array([1, 2, 3, 4, 5, -1])
# m is a CSR matrix that stores values `[1, 2, 3, 4, 5]` and shape `(1, 6)` with the value -1 being omitted.
m = DMatrix(array, missing=-1)

ExpandingMan · 2022-11-15T14:44:04Z

Thanks for the clarification.

Just to confirm, are you saying that it was intentional that the returned scipy sparse matrix may return values other than the default missing value? To be abundantly clear, when constructing such a matrix

import scipy

v = [5, 8, 3, 6]
colidx = [0,1,2,1]
rowidx = [0,1,2,3,4]

A = scipy.sparse.csr_matrix((v, colidx, rowidx), (4,4))

indexing by values not listed in v, for example A[1,0] == 0. I think this is the source of my confusion, I was not expecting this to be inconsistent with the default values from the constructed DMatrix since indeed in this case the matrix returned from get_data is different (even in the sense of not satisfying == for each element, not only that they may have different types) from the matrix used to construct the DMatrix.

trivialfis · 2022-11-17T02:03:51Z

indexing by values not listed in v, for example A[1,0] == 0

Indeed, this is inconsistent with the semantics for DMatrix. If A[1, 0] is missing, then indexing into A[1, 0] should throw an error instead of returning 0 if the semantic is respected.

trivialfis · 2022-11-21T09:06:14Z

Closing now. We can revisit if there's a better alternative to scipy sparse.

ExpandingMan mentioned this issue Nov 11, 2022

DMatrix is now an AbstractMatrix dmlc/XGBoost.jl#136

Merged

trivialfis closed this as completed Nov 21, 2022

ExpandingMan mentioned this issue Jan 13, 2023

what is the role of sparse DMatrix constructors? dmlc/XGBoost.jl#160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: clarification on DMatrix CSR return #8459

docs: clarification on DMatrix CSR return #8459

ExpandingMan commented Nov 11, 2022

trivialfis commented Nov 12, 2022 •

edited

Loading

ExpandingMan commented Nov 13, 2022 •

edited

Loading

trivialfis commented Nov 15, 2022 •

edited

Loading

ExpandingMan commented Nov 15, 2022 •

edited

Loading

trivialfis commented Nov 17, 2022

trivialfis commented Nov 21, 2022

docs: clarification on DMatrix CSR return #8459

docs: clarification on DMatrix CSR return #8459

Comments

ExpandingMan commented Nov 11, 2022

trivialfis commented Nov 12, 2022 • edited Loading

ExpandingMan commented Nov 13, 2022 • edited Loading

trivialfis commented Nov 15, 2022 • edited Loading

ExpandingMan commented Nov 15, 2022 • edited Loading

trivialfis commented Nov 17, 2022

trivialfis commented Nov 21, 2022

trivialfis commented Nov 12, 2022 •

edited

Loading

ExpandingMan commented Nov 13, 2022 •

edited

Loading

trivialfis commented Nov 15, 2022 •

edited

Loading

ExpandingMan commented Nov 15, 2022 •

edited

Loading