Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarification on DMatrix CSR return #8459

Closed
ExpandingMan opened this issue Nov 11, 2022 · 6 comments
Closed

docs: clarification on DMatrix CSR return #8459

ExpandingMan opened this issue Nov 11, 2022 · 6 comments

Comments

@ExpandingMan
Copy link

I'm in the process of implementing the new function XGDMatrixGetDataAsCSR in the XGBoost.jl Julia wrapper.

I'm a bit confused about what this is supposed to be returning in the presence of null values. Values passed to the DMatrix as NaN32 do not show up in the sparse matrix, but values passed as 0.0f0 are there. This raises the question of what is the default value of the resulting sparse matrix: is it null or zero?

Both options seem to present problems:

  • If the former, there is a major bug in the Python implementation which returns a scipy.sparse.csr_matrix which has default values of 0.
  • If the latter, either the DMatrix constructors or the matrix returned by XGDMatrixGetDataAsCSR is faulty because NaN values are getting converted to 0's somewhere.

Can someone provide clarification on the intended behavior? Thanks!

@trivialfis
Copy link
Member

trivialfis commented Nov 12, 2022

Hi, DMatrix constructor has a parameter called missing, which is defaulted to NaN. Any entries with missing value will be removed during the construction of DMatrix.

This raises the question of what is the default value of the resulting sparse matrix: is it null or zero?

It's the argument passed to missing. Any floating point number is valid.

If the former, there is a major bug in the Python implementation which returns a scipy.sparse.csr_matrix which has default values of 0.

from scipy import sparse
import xgboost as xgb
from numpy import testing
import numpy as np

values = [0.0, 1.0, 0.0]
indptr = [0, 3]
indices = [0, 1, 2]
csr0 = sparse.csr_matrix((values, indices, indptr), shape=(1, 3))

# 0s are preserved
testing.assert_allclose(csr0.data, np.array(values))
X = xgb.DMatrix(csr0, missing=np.nan)
csr1 = X.get_data()
# 0s are preserved
testing.assert_allclose(np.array(values, dtype=np.float32), csr1.data)

@ExpandingMan
Copy link
Author

ExpandingMan commented Nov 13, 2022

I'm a bit confused. The default value for sparse.csr_matrix is always 0, is it not? If that's the case, the returned csr_matrix will not match the input data unless the default value passed on construction is 0. Indeed it looks to me like in the code you provided you are only comparing the non-zero value data and not checking the elements of the matrix which are not given explicitly as part of values.

Also, is there a way to retrieve the default value from the DMatrix? I don't see a function that does that in the API.

Addendum: The documentation for csr_matrix also does not seem to give any indication that there is a way to change the default value from 0.

@trivialfis
Copy link
Member

trivialfis commented Nov 15, 2022

Let's distinguish "zero" and "missing". zero means the real number 0 on the number line, while missing means the omitted values in a sparse matrix, which may or may not be 0. Another way of saying missing is "unknown" or "invalid". With a dense matrix, one can represent the concept of "unknown" using any value, NaN, None, Inf, 0, -1, or whatever the practitioner sees fit. It's a placeholder/symbol. While for a sparse matrix, it's omitted.

For instance, given a dataset that contains a predictor storing the temperature of different locations, 0 is a valid temperature value. For locations where we haven't been able to collect the temperature, we might express it as NaN or Inf in a table:

# We don't know the temperature of location 4, use NaN to represent this piece of missing information.
X = pd.DataFrame(
    {
        "temperature": [0, 20, 13, 4, np.nan],
        "location_code": [0, 1, 2, 3, 4],
        "has_dog": [True, False, False, True, True],
    }
)

m = DMatrix(X, missing=np.nan)

From the above example, whether a value is missing is determined based on whether it represents an unknown value instead of whether it's being 0.

The default value for sparse.csr_matrix is always 0, is it not

It can be something else like NaN as in the example in the previous comment, 0 is a valid value for the csr_matrix and DMatrix in that example.

Indeed it looks to me like in the code you provided you are only comparing the non-zero value data and

I was comparing 0 values as well, you can print out the content of the CSR matrix if you wish. But indeed I'm not comparing the "unknown" value directly, anything that's not in the input csr_matrix is unknown.

not checking the elements of the matrix which are not given explicitly as part of values.

Indeed. But I'm not sure what's there to compare. The known values are correctly returned, and the rest of the values are unknown. They form a partition, if known values are correct, unknown values are also correct.

Also, is there a way to retrieve the default value from the DMatrix?

No. Only the user who created the dataset knows what value should be considered unknown/missing, it's just a symbol with one meaning: we don't know the value in those entries, and the machine learning algorithm should take that into consideration when building the model.

documentation for csr_matrix also does not seem to give any indication that there is a way to change the default value from 0.

Indeed, it's a hidden property. I picked the scipy CSR as the return value only because it's the most used CSR implementation in Python and scipy is already a hard dependency of XGBoost. There can be better choices with an option for the user to specify what's valid and what's not instead of obeying the tradition of using 0. Since you are using Julia, you can safely ignore the scipy quirks.

In short, just take DMatrix as a generalized CSR matrix that a user can specify what value to remove.

# A dense matrix with shape `(1, 6)` where `-1` means that the entry is invalid/unknown/missing.
array = np.array([1, 2, 3, 4, 5, -1])
# m is a CSR matrix that stores values `[1, 2, 3, 4, 5]` and shape `(1, 6)` with the value -1 being omitted.
m = DMatrix(array, missing=-1)

@ExpandingMan
Copy link
Author

ExpandingMan commented Nov 15, 2022

Thanks for the clarification.

Just to confirm, are you saying that it was intentional that the returned scipy sparse matrix may return values other than the default missing value? To be abundantly clear, when constructing such a matrix

import scipy

v = [5, 8, 3, 6]
colidx = [0,1,2,1]
rowidx = [0,1,2,3,4]

A = scipy.sparse.csr_matrix((v, colidx, rowidx), (4,4))

indexing by values not listed in v, for example A[1,0] == 0. I think this is the source of my confusion, I was not expecting this to be inconsistent with the default values from the constructed DMatrix since indeed in this case the matrix returned from get_data is different (even in the sense of not satisfying == for each element, not only that they may have different types) from the matrix used to construct the DMatrix.

@trivialfis
Copy link
Member

indexing by values not listed in v, for example A[1,0] == 0

Indeed, this is inconsistent with the semantics for DMatrix. If A[1, 0] is missing, then indexing into A[1, 0] should throw an error instead of returning 0 if the semantic is respected.

@trivialfis
Copy link
Member

Closing now. We can revisit if there's a better alternative to scipy sparse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants