Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Weighted cutflow #1272

Merged
merged 13 commits into from
Feb 24, 2025
Merged

feat: Weighted cutflow #1272

merged 13 commits into from
Feb 24, 2025

Conversation

NJManganelli
Copy link
Collaborator

@NJManganelli NJManganelli commented Feb 6, 2025

This PR adds an (opinionated) take on adding weighted cutflows to coffea, permitting the cutflow to be instantiated with a Weights object and a weights modifier (i.e. a weights variation name, as in Weights.weights("jesUp"))

Edit: Equivalent in NMinusOne would also be highly desirable, but this PR has become large enough that it should be split off (and incorporate any changes requested in review)

@NJManganelli NJManganelli changed the title [Feature] Weighted cutflow feat: Weighted cutflow Feb 7, 2025
@lgray
Copy link
Collaborator

lgray commented Feb 7, 2025

FAILED tests/test_analysis_tools.py::test_packed_selection_cutflow_dak[True] - TypeError: meta must be an instance of an Awkward Array, not <class 'module'>.
FAILED tests/test_analysis_tools.py::test_packed_selection_cutflow_dak[False] - TypeError: meta must be an instance of an Awkward Array, not <class 'module'>.
FAILED tests/test_analysis_tools.py::test_packed_selection_cutflow_dak_uproot_only[True] - TypeError: meta must be an instance of an Awkward Array, not <class 'module'>.
FAILED tests/test_analysis_tools.py::test_packed_selection_cutflow_dak_uproot_only[False] - TypeError: meta must be an instance of an Awkward Array, not <class 'module'>.

@NJManganelli
Copy link
Collaborator Author

Aye, those failures are the attempt to use fill_flattened in dask which doesn't currently work, so it will be rolled-back to manual flattening for now

Nick Manganelli added 2 commits February 17, 2025 17:41
…edSelection class

Adjust result object to not break unpacking of cutflow result, restoring one test; need dask-contrib/dask-awkward#531

Remove unnecessary return Cutflow

Add boolean_masks_to_categorical_integers for optimization of histogram fill in cutflow

WIP add the optimized version of filling cutflows, with a testable v2 variable temprorarily, and new categorical option for additional axis (needs documentation and example)

Rename slice to slc

WIP weighted cutflow with commonmask, categorical axes, plot_vars still needs much work and cleanup

Untested plot_vars function with commonmask and categorical support, using broadcast_arrays func

Fixes for original tests to pass

WIP - fixes from tests, expanded print header

revert change to masks in compute for npz

Fix commonmask handling of initial weighted event counts

Some logic fixes for weights in npz file

Fix for v2 eager mode

Handle delayed cutflow to_npz pathway, Weights does not have a compute method

black src/coffea/analysis_tools.py
…low from PackedSelection class

weights testing okay, histos next

weighted and commonmask tests in place, categorical next

stricter tests for weights, passing

Try v2 yieldhist

categorical tests added for yieldhist

Added categorical tests, need one more specific to histograms then cleanup

Added cutflow_extended_dak tests

black tests/test_analysis_tools.py
Nick Manganelli and others added 6 commits February 19, 2025 18:39
…ficant performance difference, fix delayed commonmasked path, remove debug print statement, add scale for all paths of yieldhist, plot_vars; reduce LOC and make more consistent
…rmitting parallel testing by pytest-xdist; 1 core speed: '52 passed, 729287 warnings in 487.82s (0:08:07)', 4 core speed: '52 passed, 732599 warnings in 148.59s (0:02:28)'
@NJManganelli
Copy link
Collaborator Author

NJManganelli commented Feb 19, 2025

I'll convert this to ready for review, presuming all the tests (very, very slowly) pass.

I tacked on a change to the npz tests to write to temporary directories, this should allow running them in parallel (verified to work on my laptop), cross-reference #1278
Also, see this issue, which might be alleviated by this change: #1246

The extended cutflow tests technically cover the non-extended tests when all non-dask options are False, but I left the original tests in-place out of paranoia. But, importantly, the dak tests call compute a lot... several dozen times, and so an individual test can be O(1 minute). These tests might benefit from refactoring to follow the ideal dask pattern a bit more: setup graph, execute/compute once. But best left to a dedicated PR, IMO (no time to fold it into this one, right now) Edit: The optimization has now been added in a pair of commits, see Feb 20th updates

Now the opinionated part:

  1. the cutflow expects a Weights() instance and optionally a specific (str) modifier to pass to it. Partially this was because I originally envisioned (optionally) passing in a list of weight variations and being able to fill a new systematics axis with them. That may be worth doing one day, but not in this PR
  2. I added a "commonmask" feature. This lets you apply another boolean array, potentially external-to-packed-selection, to every individual and stacked cut. Originally I used this to handle samples where many different GenModels are stored together per-file, with an event-level bit indicated it belongs to a particular GenModel hypothesis. Since there can easily be 100s of these hypotheses in the same file (which, BTW, is absolutely terrible for performance when doing loops over corresponding masks and filling histograms, per-hypothesis), they don't fit in a PackedSelection, and creating copies of PackedSelection per-hypothesis when the only cut that differs is effectively a single boolean array is very wasteful and blows up the task graph complexity. That was the original motivation, but it works just as well -- if not better -- for many other cases, and lets you start the cutflows off from some 'base' more easily (one can also ignore the initial unmasked fill, but I like this a bit better.
  3. I added a "categorical" feature. This was also developed for the GenModel case originally, when it became apparent that masking every array for each hypothesis and doing separate histograms, etc. was insanely expensive time-wise. Ergo, this allows one to pass in an event-by-event categorical value (ideally an int, to be filled into an Integer or IntCategory histogram axis) and instantiates an additional axis that can be sliced/projected/etc. Eventually, I'd like this to support a StrCategory axis in an efficient way, but that will depend on getting this feature implemented: https://iris-hep.slack.com/archives/C03UA418VJ8/p1719933533612429
  4. I added a scale to the print, yieldhist, and plot_vars methods. Originally only put it in print, and figured it is really for the user to scale histograms afterwards, and in particular one cannot (currently) scale a dask-histogram prior to compute and have it registered in the dask task-graph (so no scaling will occur pre-compute). Since support per-event weights for PackedSelection.cutflow #1270 indicated a desire for both per-event weighting (which I hope can cover the usecase detailed more in iris-hep slack with external-tools providing weights of some kind, but not sure) and broadcastable weights, I worked an implementation in for the histogram methods as well. I hesitated to balloon the testing time more with an explicit scale test, but I did a variation locally that worked as expected, so it would make sense to add something later (also for the print function, which doesn't have any coverage in the tests I based these extended ones upon)...
  5. the yieldhist and plot_vars methods do not follow the same design pattern (as I wrote the latter recently and the former close to a year ago, this is unfortunate but also seemed a necessity to handle the jagged-arrays expected for plot_vars). The latter follows a simpler cut-by-cut filling scheme, and has a unified method for delayed/non-delayed (I hope using an appropriate approach, but let me know if not). The former, in order to support all features, has three code paths (two the updated versions of non-delayed and delayed from before, now supporting weights and commonmasks; the third to support categorical fills with any combination of the other features, but at a performance cost since it cannot rely on the nevonecut/nevcutflow -- or corresponding weighted variants -- to do a fast few-entry fill). The last code path covers all use-cases, but in an end-to-end test ingesting a ~2GB file, setting up a PackedSelection and Weights classes with just a few entries, then calling yeildhists and computing... averaged about 1 minute for the eager version on the 'fast path' and about 3 minutes on the per-event fills in the unified/categorical path. That prompted me to leave the updated delayed and non-delayed paths in. A future change might unify these, presuming others agree it's good to keep them around, since faster is better where possible (at the expense of more code maintenance, of course...). The yieldhists method also relies on some code that converts a list of boolean arrays into a (ragged) set of categorical integers. That was originally developed to improve performance when I was trying to discover the best way forward a year ago. Whether it's worth it now, in hindsight, would take more testing and mimicking the cut-by-cut approach again. Finally, I'll cross-reference TypeError: meta must be an instance of an Awkward Array, not <class 'module'> triggered by fill_flattened for a dask-histogram and dask-awkward arrays dask-contrib/dask-awkward#531 as fill_flattened would reduce LOC by a good chunk, but needs some development (as outlined by Angus in his reply to that issue)

Fin
(but not really)

@NJManganelli NJManganelli marked this pull request as ready for review February 19, 2025 20:59
@NJManganelli
Copy link
Collaborator Author

Screenshot 2025-02-19 at 22 24 00 Non-parallel testing times (for comparison to the pytest-xdist PR)

@lgray
Copy link
Collaborator

lgray commented Feb 19, 2025

Have we got any idea about the really large time taken for the new tests?

@lgray
Copy link
Collaborator

lgray commented Feb 19, 2025

Could you please mark the tests that use dask, as was done in the recent pr from Andrzej?

@NJManganelli
Copy link
Collaborator Author

NJManganelli commented Feb 19, 2025

Should any test using dask be marked, or only those instantiating/utilizing a Client()?

BTW, I tested these cutflow_dak tests with a Client and it was slower on my machine than without.

The original cutflow_dak test (and therefore, the 16 variations of cutflow_extended_dak, which started as a carbon copy of cutflow_dak) take more time than they really should because of the frequent calls to compute. It should be refactored to reduce those calls, as noted above, but it'll be a bit of moving things around to make that work.

For comparison, the time of tests after Andrzej's PR was merged in:
Screenshot 2025-02-20 at 00 30 27

@NJManganelli
Copy link
Collaborator Author

From my understanding, marking these tests as with-dask would prevent them running in parallel, and these ones are closer to ideal scaling than not. Let me know if I've misunderstood.

Some refactoring has eliminated pretty much all the extraneous computes. There's about a 7x improvement (read reduction) in the dak tests added for this PR, and times for just that test and the whole analysis_tools test suite follow:

d98a039==a184c35 corresponds to optimized tests
98d68ef corresponds to a halfway point where many compute calls were removed
c68955c corresponds to the version whose times I posted most immediately above, with dozens of compute calls.

Commit d98a0395, pytest -n 1 tests/test_analysis_tools.py, 52 passed, 208063 warnings in 159.53s (0:02:39)
Commit d98a0395, pytest -n 4 tests/test_analysis_tools.py, 52 passed, 211375 warnings in 52.04s
Commit d98a0395, pytest -n 4 tests/test_analysis_tools.py::test_packed_selection_cutflow_extended_dak, 16 passed, 74500 warnings in 19.75s
Commit 98d68ef2, pytest -n 1 tests/test_analysis_tools.py, 52 passed, 348223 warnings in 248.91s (0:04:08)
Commit 98d68ef2, pytest -n 4 tests/test_analysis_tools.py, 52 passed, 351535 warnings in 75.05s (0:01:15)
Commit 98d68ef2, pytest -n 4 tests/test_analysis_tools.py::test_packed_selection_cutflow_extended_dak, 16 passed, 214660 warnings in 49.92s
Commit c68955c6, pytest -n 1 tests/test_analysis_tools.py, 52 passed, 729287 warnings in 489.46s (0:08:09)
Commit c68955c6, pytest -n 4 tests/test_analysis_tools.py, 52 passed, 732599 warnings in 148.17s (0:02:28)
Commit c68955c6, pytest -n 4 tests/test_analysis_tools.py::test_packed_selection_cutflow_extended_dak, 16 passed, 595724 warnings in 132.85s (0:02:12)

@ikrommyd
Copy link
Collaborator

Thanks for the refactoring Nick! It was a mistake when I wrote thoses tests like that (I didn't know the impact back then) and I just never fixed it afterwards.

@NJManganelli
Copy link
Collaborator Author

While anti-patterning a bit, it's a lot clearer to write the tests with all those computes, though (to play devil's advocate).

Note that I've left the original ('non-extended') test alone. On my M4Pro with 1 core, it takes 50s to run both variations of it (optimization on/off), whereas the one I added based on it, the extended_dak test, has 2x2x2x as many variations, hence the painfully long additional time on the github runners. While that's an optimization opportunity, it'll take a similar amount of shuffling around as for the extended_dak test

For posterity, the run times now:
Screenshot 2025-02-20 at 19 58 05

@ikrommyd
Copy link
Collaborator

It may also make sense to not check both optimizations options at some point. Those were finding some issues in earlier stages where optimization was messing something up but they haven’t found something in a while. I would argue that it’s generally bad to remove tests but maybe it’s fine to always have optimization on in the future.

@lgray
Copy link
Collaborator

lgray commented Feb 21, 2025

Honestly for weights you may even only need to check that it builds the typetracer and the graph without failing.
The mechanisms for touching data on individual columns are pretty thoroughly tested, and probably don't need further exercise here. Just a thought.

@NJManganelli
Copy link
Collaborator Author

Any feedback/change requests on the implementation?

For the new tests, I could disable things now (just optimization off?), though at about 19s to run all variations multicore on my Mac (maybe 2x worse on the github runners?) there won't be huge savings there. Disabling it on the older test could save some similar amount of time. Just let me know. This also seems like it could be compiled into a dedicated test optimization pass to assist in improving the scikit-hep integration testing.

@lgray
Copy link
Collaborator

lgray commented Feb 24, 2025

I think it's in a reasonable state at this point, tests are a bit long but we can whittle on this.
The changes to the code have made it a large enough base that factorization into its own library area within coffea, for readability, probably makes since but that can come as another PR.

Forking off the optimization on/off tests into their own jobs is a good idea, but that can also come in another PR.

Thanks for the addition!

@lgray lgray merged commit 7f65c2f into scikit-hep:master Feb 24, 2025
21 checks passed
@ikrommyd
Copy link
Collaborator

@NJManganelli Would you like at some point to extend this to N-1 hists as well? If you ever get some time of course. It will be mostly copy pasting stuff cause the original code is pretty much the same.

@NJManganelli
Copy link
Collaborator Author

Hi @ikrommyd Yeah, it should be extended there too. As tempting as it is to 'balance things out,' I'll need to wait a while and work on some Upgrade, Analysis, and R&D works first.

Thanks @lgray , should I make an issue or two to track these to-dos? Test time reductions, refactorization of analysis tools, extension of N-1...

@lgray
Copy link
Collaborator

lgray commented Feb 25, 2025

Yes, please make a tracking issue! Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants