NeXus `search_quantities` in NOMAD #542

lukaspie · 2025-02-07T15:01:31Z

Recently, we have started playing around with adding more and more search quantities in NOMAD (mostly in #525). In particular, this includes adding all NeXus attributes as NOMAD quantities (using the __<attribute_name> convention) and field aggregation statistics (using __mean, __var, etc.), with the idea of powering the search in the apps that we are developing.

However, we soon started running into some limitations. Most notably, the GUI becomes incredibly slow for groups that have lots of fields. A noteable example is the XPS example, where there are a lot of cycle and scan repititions in the default NXdata group (see image). Loading the data/data/ENTRY:0/data tab here takes up to 1 min due to a lag in the GUI. Similar situations are expected in descriptions of microstructures, with potentially millions of instances of the same concept.

So, we are faced with an issue: which search_quantities do we want to expose from the NeXus definitions in NOMAD. This issue is meant to summarize the current situation and suggested ideas and also to store any results coming from upcoming discussions on this, e.g. in the TF meetings.

One suggestion (made by @sanbrock and @rettigl) is that we only make those concepts that are specifically mentioned in the application definition available for search. That includes all required or recommended elements, but also those that are optional. But explicitly not those that are just defined in the base class that are inherited in the application definition.
Comments so far on this:
1. This could be a good solution, at least for now. We must consider that the appdefs will likely need to get more comprehensive to be usable in NOMAD and will probably blow up a bit. For MPES, Laurenz and I already started to mention more terms explicitly in the application definition (including all the AXISNAME and DATA fields in NXdata), see MPES: new concepts from NIAC discussions, searchable fields nexus_definitions#329.
2. It is a bit strange to make the contents of the appdef depend on how we want to use them in NOMAD. Of course, adding more optional elements (as done above) is not a problem, but it feels slightly backwards to build the appdef for an experimental technique based on the search and visualization capabilities of NOMAD.
3. If a new use case for an appdef comes up, where more concepts from the base classes are to be used, you would need to blow up the appdef even more or make an extension for it (like NXxps extending NXmpes), to allow for more search_quantities that are relevant for this new use case. The second option raises a problem/question which I have been asking myself for a while: If an appdef extends another one (think NXxps extending NXmpes), is it possible to add more elements or can they only specialize what is already in the sup-appdef. If I have an NXfit class in NXxps, but not in NXmpes, is a file that implements NXxps with a fit even compliant with NXmpes anymore?
There is an alternative proposed by @mkuehbach: we ship a "concept filtration configuration" with pynxtools that explicitly states for each application definition what the searchable quantities should be. This would be a yaml/json file that defines a selected set of elements you can search for. This may include all or some of the appdef concepts, but could also include some of those from the base classes.
Comments so far on this:
1. This is of course another abstraction layer that the user is not aware of and that makes it difficult to understand which elements are searchable for which appdef.
2. Whenever the application definition or base classes change, we also need to update these configurations.
3. Such configurations should probably be shipped with the main pynxtools, not with the reader plugins. This avoid conflicts that could arise for two plugins writing to the same appdef (pynxtools-mpes/xps both write to NXmpes) or a plugin that can touch many appdefs (like pynxtools-igor).
4. We use a similar option already for the configuration of the multiformat reader. These are JSON files that map concepts defined in the vendor specific files to the concepts in the appdef. For this, we have a CLI function that generates a template for a given application definition. We could likely reuse the python code of that script to give the default search_quantities config, i.e. all concepts defined in the appdef. This would then need to be customized.
5. Another option this approach could open up: we could use the same filter for exporting from NOMAD. That is, a pynxtools-adjacent tool takes an existent NeXus file/data archive in NOMAD and a filter map and exports a smaller NeXus file. Here we are adding to the macro issue in NOMAD that is about exporting from NOMAD, but more than just download that file .

We could also go for a combination of the two approaches: for specific appdefs (i.e., the ones designed by FAIRmat) you have a "concept filtration configuration", whereas for the other application definition you can only search for what is in the appplication definition itself (this would be the default).

Further comments:

Regarding the aggretation statistics, in order to not increase the number of quantities, we could bring NeXus fields to NOMAD as subsections. This was indeed implemented in the first version of the NexusParser as there was no support for attributes). So, we could make a subsection for each number quantity like 'energy__field' . This subsection would contain all its stats. And all its attributes we could also bring here as Quantities.

Sorry for the wall of text, but I just wanted to summarize the current situation accurately. Looking for input here and in the upcoming TF meetings. @FAIRmat-NFDI/areab

The text was updated successfully, but these errors were encountered:

rettigl · 2025-02-10T13:52:51Z

2. It is a bit strange to make the contents of the appdef depend on how we want to use them in NOMAD. Of course, adding more optional elements (as done above) is not a problem, but it feels slightly backwards to build the appdef for an experimental technique based on the search and visualization capabilities of NOMAD.

I don't think that this depends on NOMAD. If we consider "searching for an element" as an application, then having all searchable fields in the corresponding application definition is natural.

rettigl · 2025-02-10T13:54:13Z

3. If a new use case for an appdef comes up, where more concepts from the base classes are to be used, you would need to blow up the appdef even more or make an extension for it (like NXxps extending NXmpes), to allow for more search_quantities that are relevant for this new use case. The second option raises a problem/question which I have been asking myself for a while: If an appdef extends another one (think NXxps extending NXmpes), is it possible to add more elements or can they only specialize what is already in the sup-appdef. If I have an NXfit class in NXxps, but not in NXmpes, is a file that implements NXxps with a fit even compliant with NXmpes anymore?

That is a good point. I think the current mechanism in Nexus does not prevent adding arbitrary entries, so there is no relationship of the kind NXspx is_a NXmpes.

rettigl · 2025-02-10T13:55:09Z

This is of course another abstraction layer that the user is not aware of and that makes it difficult to understand which elements are searchable for which appdef.

Alone for that reason, I have some issues with this idea.

rettigl · 2025-02-10T13:57:50Z

Regarding the aggretation statistics, in order to not increase the number of quantities, we could bring NeXus fields to NOMAD as subsections. This was indeed implemented in the first version of the NexusParser as there was no support for attributes). So, we could make a subsection for each number quantity like 'energy__field' . This subsection would contain all its stats. And all its attributes we could also bring here as Quantities.

I am not sure I understand what this would mean conceptually.

lukaspie · 2025-02-10T16:56:20Z

Problems:

in XPS, we have hundreds of scans that are all instantiated as NXdata/DATA, creating too many quantities in one subsection (crashing/slowing down the GUI)
in APM, we have two problems hitting the limit of elastic search (~10k search quantities, arbitrary limit in NOMAD that will not be increased due to space restrictions)
a. too many concepts with instantiable data
b. too many instances for some of the defined concepts (think 2000 different samples)
in the MPES app, we cannot search for specific concepts with variadic names (example was beamTYPE in the application definition, which can be used for beam_probe and beam_pump and you only want to search for the energy of beam_probe)

Discussion/Decisions:

For 1.: try to clean up the GUI problems.
- Solution for now is to show all fields by default, but rather you have a "more" section in the GUI to show more and more fields -> we agreed on this
- Possibility: make a registry that creates grouping of these fields for easier navigation in the GUI (possibly alphabetical)
For 2.:
- You can increase the number of elastic search quantities in an OASIS for a specific use case
- Parsing should still continue even if there are more instances than allowed search quantities. Currently the whole processing chain stops if there are too many instances than allowed search quantities. Instead, a warning should we thrown, but the data should be stored anyway in JSON.
For 3.
- Clarify the application definition(s) to clear up the ontological mismatch. In the example above, we implemented beam_probe and beam_pump as separate concepts in the application definition.
- Combined/conditional search: today not supported, but could possibly be in the future
  - Example: give me the mean of a concept (like beamTYPE) if the instance name has a certain value (beam_probe)
  - Example: give me all the user_name where the user_affiliation is "FHI"
    - Question: do you want to type "FHI" in the search bar or do you want to do this in a specific widget
  - We would like to have these combinations in one new widget -> this will probably be pretty involved
Search bar: should there be an autocompletion for searching, i.e. to simplify the syntax?
- will not be implemented in the old GUI, but will be in the new UI
We keep NeXus fields and attributes as NOMAD quantities. This hinges on the GUI's ability to handle this.

Action items for now:

Clarify the application definition(s) to clear up the ontological mismatch (i.e., where a variadic name actually describes two or more separate concepts)
Visualization of many concepts in one lane using a "more" section
Change: make the processing chain continue even if there are too many instances than allowed search quantities, throw error instead
Lauri is working on the scatter plot, check that it is working (i.e., a scatter plot shows up)
Lauri is working on the search bar, check that it is working (i.e., valid queries are actually showing correctly)
Support OR queries for the terms widget

Action items (postponed):

Make a registry that creates grouping of these fields for easier navigation in the GUI (possibly alphabetical)
Documentation: what should you do if there are more instances than allowed search quantities?
For quantities with variadic names, make its instance name searchable (already achieved for sections)
Terms widget: add a scroll bar if possible
Create a new search widget that implements conditional search as outlined above (low priority)

For future discussion, if the action items don't solve our problems right now, we may have to think about ways of limiting the allowed search_quantities.

Suggestions for limiting the search_quantities:

The only allowed search_quantities are those fields and attributes that
- are explicitly mentioned in the application definitions under fixed and variadic names
  - Use case: single lab has a specific axis in NXdata that they always use. They don't want to make a new application definition, but still search for this axis. If we have a variadic NXdata/DATA in the application definition, this search is allowed as well..
  - Use case: if we have beamTYPE in the application definition, it is also allowed to search for beam_laser.
- are mentioned in those base classes that are specialized in the application definition and that are using nameType=specified, i.e. not those that are variadic. An example what we don't want to make searchable are instances of NXdata/DATA for which the name is not explicitly given in the appdef.
  - Use case: this solves 1. above, limiting the XPS, we have hundreds of scans that are all instantiated as NXdata/DATA, creating too many
- What about nested definitions, think NXmpes/INSTRUMENT, can all fields in base_classes mentioned in NXinstrument be searched?
Consequently, we only parse those fields that fullfil these requirements into the Data section.
For the aggregation statistics, we keep creating them as before, but now we will just have less fields, so the GUI issues will (hopefully) go away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeXus `search_quantities` in NOMAD #542

NeXus `search_quantities` in NOMAD #542

lukaspie commented Feb 7, 2025

rettigl commented Feb 10, 2025

rettigl commented Feb 10, 2025

rettigl commented Feb 10, 2025

rettigl commented Feb 10, 2025

lukaspie commented Feb 10, 2025 •

edited

Loading

NeXus search_quantities in NOMAD #542

NeXus search_quantities in NOMAD #542

Comments

lukaspie commented Feb 7, 2025

rettigl commented Feb 10, 2025

rettigl commented Feb 10, 2025

rettigl commented Feb 10, 2025

rettigl commented Feb 10, 2025

lukaspie commented Feb 10, 2025 • edited Loading

NeXus `search_quantities` in NOMAD #542

NeXus `search_quantities` in NOMAD #542

lukaspie commented Feb 10, 2025 •

edited

Loading