-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: clarification on DMatrix CSR return #8459
Comments
Hi, DMatrix constructor has a parameter called
It's the argument passed to
from scipy import sparse
import xgboost as xgb
from numpy import testing
import numpy as np
values = [0.0, 1.0, 0.0]
indptr = [0, 3]
indices = [0, 1, 2]
csr0 = sparse.csr_matrix((values, indices, indptr), shape=(1, 3))
# 0s are preserved
testing.assert_allclose(csr0.data, np.array(values))
X = xgb.DMatrix(csr0, missing=np.nan)
csr1 = X.get_data()
# 0s are preserved
testing.assert_allclose(np.array(values, dtype=np.float32), csr1.data) |
I'm a bit confused. The default value for Also, is there a way to retrieve the default value from the Addendum: The documentation for |
Let's distinguish "zero" and "missing". zero means the real number 0 on the number line, while For instance, given a dataset that contains a predictor storing the temperature of different locations, 0 is a valid temperature value. For locations where we haven't been able to collect the temperature, we might express it as # We don't know the temperature of location 4, use NaN to represent this piece of missing information.
X = pd.DataFrame(
{
"temperature": [0, 20, 13, 4, np.nan],
"location_code": [0, 1, 2, 3, 4],
"has_dog": [True, False, False, True, True],
}
)
m = DMatrix(X, missing=np.nan) From the above example, whether a value is missing is determined based on whether it represents an unknown value instead of whether it's being 0.
It can be something else like
I was comparing 0 values as well, you can print out the content of the CSR matrix if you wish. But indeed I'm not comparing the "unknown" value directly, anything that's not in the input csr_matrix is unknown.
Indeed. But I'm not sure what's there to compare. The known values are correctly returned, and the rest of the values are unknown. They form a partition, if known values are correct, unknown values are also correct.
No. Only the user who created the dataset knows what value should be considered unknown/missing, it's just a symbol with one meaning: we don't know the value in those entries, and the machine learning algorithm should take that into consideration when building the model.
Indeed, it's a hidden property. I picked the scipy CSR as the return value only because it's the most used CSR implementation in Python and scipy is already a hard dependency of XGBoost. There can be better choices with an option for the user to specify what's valid and what's not instead of obeying the tradition of using 0. Since you are using Julia, you can safely ignore the scipy quirks. In short, just take # A dense matrix with shape `(1, 6)` where `-1` means that the entry is invalid/unknown/missing.
array = np.array([1, 2, 3, 4, 5, -1])
# m is a CSR matrix that stores values `[1, 2, 3, 4, 5]` and shape `(1, 6)` with the value -1 being omitted.
m = DMatrix(array, missing=-1) |
Thanks for the clarification. Just to confirm, are you saying that it was intentional that the returned import scipy
v = [5, 8, 3, 6]
colidx = [0,1,2,1]
rowidx = [0,1,2,3,4]
A = scipy.sparse.csr_matrix((v, colidx, rowidx), (4,4)) indexing by values not listed in |
Indeed, this is inconsistent with the semantics for DMatrix. If |
Closing now. We can revisit if there's a better alternative to scipy sparse. |
I'm in the process of implementing the new function
XGDMatrixGetDataAsCSR
in the XGBoost.jl Julia wrapper.I'm a bit confused about what this is supposed to be returning in the presence of null values. Values passed to the
DMatrix
asNaN32
do not show up in the sparse matrix, but values passed as0.0f0
are there. This raises the question of what is the default value of the resulting sparse matrix: is it null or zero?Both options seem to present problems:
scipy.sparse.csr_matrix
which has default values of0
.DMatrix
constructors or the matrix returned byXGDMatrixGetDataAsCSR
is faulty becauseNaN
values are getting converted to0
's somewhere.Can someone provide clarification on the intended behavior? Thanks!
The text was updated successfully, but these errors were encountered: