Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the nx_char type for numpy to and . #554

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

RubelMozumder
Copy link
Collaborator

@RubelMozumder RubelMozumder commented Feb 18, 2025

@lukaspie, you can also check this issue with here.

@lukaspie
Copy link
Collaborator

What is this solving? The linked issue is about integers, this is about strings/bytes.

Why don't we allow chararrays anymore?

@rettigl
Copy link
Collaborator

rettigl commented Feb 18, 2025

Please see #555 and the discussed way for solving this. We should remove arrays generally, and check the array dtype instead.

@RubelMozumder
Copy link
Collaborator Author

What is this solving? The linked issue is about integers, this is about strings/bytes.

Why don't we allow chararrays anymore?

chararry is a data structure like ndarray which stores data according to the premitive datatype. With the current implementation, we are not checking any data structure like list or ndarray but rather the primitive types of that ndarray such as int, bytes_, float, str_, bool.

@RubelMozumder
Copy link
Collaborator Author

@GinzburgLev

Can you remember which issue was handled in the modification https://github.com/FAIRmat-NFDI/pynxtools/blame/nx_char_type/src/pynxtools/dataconverter/helpers.py#L729 ?
Currently, this code converts an integer to a sting and skips the validation warning.

Then I can also test my code for that issue as well.

@GinzburgLev
Copy link
Contributor

@GinzburgLev

Can you remember which issue was handled in the modification https://github.com/FAIRmat-NFDI/pynxtools/blame/nx_char_type/src/pynxtools/dataconverter/helpers.py#L729 ? Currently, this code converts an integer to a sting and skips the validation warning.

Then I can also test my code for that issue as well.

This change was addressing the following issue: #393

The example from this issue should do: nexus file with one of the datasets with type NX_FLOAT, filled with integer = 0 (such as mpes example upload file, with sp.binned.attrs["metadata"]["energy_calibration"]["tof"] = 0 instead of 0.0). The is_valid_data_field() function in helpers.py successfully converted it to float(0.0), but this result was later interpreted as False (as in, failure to convert) by whatever was calling is_valid_data_field(). I can send you the files I used for testing (to big to be attached here).

@RubelMozumder
Copy link
Collaborator Author

@GinzburgLev
Can you remember which issue was handled in the modification https://github.com/FAIRmat-NFDI/pynxtools/blame/nx_char_type/src/pynxtools/dataconverter/helpers.py#L729 ? Currently, this code converts an integer to a sting and skips the validation warning.
Then I can also test my code for that issue as well.

This change was addressing the following issue: #393

The example from this issue should do: nexus file with one of the datasets with type NX_FLOAT, filled with integer = 0 (such as mpes example upload file, with sp.binned.attrs["metadata"]["energy_calibration"]["tof"] = 0 instead of 0.0). The is_valid_data_field() function in helpers.py successfully converted it to float(0.0), but this result was later interpreted as False (as in, failure to convert) by whatever was calling is_valid_data_field(). I can send you the files I used for testing (to big to be attached here).

Thanks, I see it is not from your code. Can you help me to reproduce this error?

@GinzburgLev
Copy link
Contributor

GinzburgLev commented Feb 19, 2025

Then I can also test my code for that issue as well.

This change was addressing the following issue: #393
The example from this issue should do: nexus file with one of the datasets with type NX_FLOAT, filled with integer = 0 (such as mpes example upload file, with sp.binned.attrs["metadata"]["energy_calibration"]["tof"] = 0 instead of 0.0). The is_valid_data_field() function in helpers.py successfully converted it to float(0.0), but this result was later interpreted as False (as in, failure to convert) by whatever was calling is_valid_data_field(). I can send you the files I used for testing (to big to be attached here).

Thanks, I see it is not from your code. Can you help me to reproduce this error?

In the version of pynxtools that we had back then, I used dataconverter in the following way:

dataconverter --reader mpes --nxdl NXmpes raw_files/mZlhpEzOQY6Ms4hJbIIu3g/config_file.json raw_files/mZlhpEzOQY6Ms4hJbIIu3g/MoTe2_float.h5

dataconverter --reader mpes --nxdl NXmpes raw_files/mZlhpEzOQY6Ms4hJbIIu3g/config_file.json raw_files/mZlhpEzOQY6Ms4hJbIIu3g/MoTe2_int.h5

The first one worked fine, no errors; the second gave something like "Field /ENTRY[entry]/PROCESS_MPES[process]/energy_calibration/original_axis written without documentation." (aside from usual messages). The resulting file was missing corresponding field (or its value, I am not sure anymore) (output.nxs/entry/process/energy_calibration/original_axis)

The issue was present before this PR was merged:
#522

@RubelMozumder
Copy link
Collaborator Author

RubelMozumder commented Feb 20, 2025

Currently, verification of NX data type does not convert the data to other data type anymore rather pop up an warning for any data type inconsistency. For example data 2.0 for NX_INT would not convert to 2 and vice-versa. In NeXus We have specific, data type NX_INT and NX_FLOAT in this case. If 2.0 and 2 is equivalent one can use NX_NUMBER in this case.

Such conversion creates issues in some cases such as for numbers >1 to true and any string e.g. "2" to int/float 2/2.0 and silently validator pass them without any warning massages. Such issues is also observed in here #555. In that issue, a vector attributes has been defined in app def NX_CHAR which is incorrect. As the validator does not create any warning message here, nobody can detect that error that comes through the app def.

@RubelMozumder RubelMozumder linked an issue Feb 20, 2025 that may be closed by this pull request
@RubelMozumder
Copy link
Collaborator Author

  • Nyaml needs to be fixed and merged before merging this PR, so that we can skip the circumvent the errors.

Copy link
Collaborator

@lukaspie lukaspie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix. Left some small comments.

Let's wait with the merge until we have the correct definitions here and the plugin tests pass.

Comment on lines +599 to +600
# Not to be confused with `np.byte` and `np.ubyte`, these store
# and integer of `8bit` and `unsigned 8bit` respectively.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Not to be confused with `np.byte` and `np.ubyte`, these store
# and integer of `8bit` and `unsigned 8bit` respectively.
# Not to be confused with `np.byte` and `np.ubyte`, these store
# integers of `8bit` and `unsigned 8bit` respectively.

What exactly is not to be confused? Maybe write "np.xxx is not to be confused..."

return False
def check_all_children_for_callable(
objects: Union[list, np.ndarray],
checker: Optional[Callable] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename this to callable

Comment on lines +712 to +713
This function also converts bool value comes in str format. In case, it fails to
convert, it raises an Exception.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This function also converts bool value comes in str format. In case, it fails to
convert, it raises an Exception.
This function also converts boolean value that are given as strings (i.e., "True" to True).

It doesn't really raise an Exception, but just a ValidationProblem.InvalidDatetime warning. What were you trying to say here?

collector.collect_and_log(
path, ValidationProblem.InvalidType, accepted_types, nxdl_type
)
return False, value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return value False here is interpreted as "undocumented field" in the calling function. So we get an additional wrong warning if the dtype does not match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it is just for internal logic (please see line 750-766 in validation.py. In is_docmentedit checks the datatype, node type (is field or group). But the real type of the errors are collect and print according to the validation error type in and fromcollect` function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you are trying to say here. My point is that is_documented returns False if the datatype does not match, which I would say is wrong, because it is documented, but just has the wrong data type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior was btw. introduced in #522, I believe. Maybe @GinzburgLev or @lukaspie can comment?

Comment on lines +715 to +717
Returns two values:
boolean (True if the the value corresponds to nxdl_type, False otherwise)
converted_value bool value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add typing annotation. The calling function expects a single bool.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return handling is correct, nevertheless I suggest to add typing everywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust all occurences of is_valid_data_field (e.g. line 552)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RubelMozumder @sanbrock Returning False in L552 if the data type of a field does not fit effectively makes this field "undocumented". Is this intentional? The produced warnings at least are rather contradictive:

WARNING: The value at /ENTRY[entry]/data/@axes should be one of: (<class 'str'>, <class 'numpy.str_'>, <class 'numpy.bytes_'>), as defined in the NXDL as NX_CHAR.
WARNING: Field /ENTRY[entry]/data/@axes written without documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanbrock @sherjeelshabih Also, apparently during this second round of checking for undocumented fields, that are not being found before, no checks for enumerations are being done. I don't really understand the logic of this checking completely, but I suggest a checking for enums should happen here as well.
Why can't we re-use the mechanism in recurse_tree here, i.e. the handling_map and corresponding functions?

if not check(obj, *args):
return False
def check_all_children_for_callable(
objects: Union[list, np.ndarray],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extend this to all iterable data types (e.g. tuples, ...)

tmp_arr = None
if isinstance(objects, list):
# Handles list and list of list
tmp_arr = np.array(objects)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work of the dtypes within a list are not the same?

@rettigl
Copy link
Collaborator

rettigl commented Feb 27, 2025

@RubelMozumder @sherjeelshabih I suggest to meet and discuss how to handle this PR and #557 . It does not make sense to independently develop different solutions for the same problem.

@sherjeelshabih
Copy link
Collaborator

Agreed. I already plan to merge what Rubel has here with some changes I want to keep in the other branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datatype checking for array dtype
5 participants