[r][cpp] add pseudobulking to matrices #128

immanuelazn · 2024-09-18T05:39:51Z

Description

Added function matrix_quantile_per_cell() to find the nth quantile value of each cell in a matrix. Allows for clipping using min_by_row() and min_by_col()
Added pseudobulk_matrix() with option to clip by quantile, and to aggregate by non-zeros, mean, sum, variance.

Tests

For pseudobulk_matrix(), add tests for clipping, non-zeros, mean, sum, and variance in both transpose states.
For matrix_quantile_per_cell(), add test comparing against R quantile function across transpose states and quantile values

Details

I've iterated on pseudobulk_matrix() a few times as shown in commit history. I tried to be a little bit smarter by using matrix multiplies to utilize multi-threading. However, the solution for non-zeros and variance is probably non-optimal due to requiring the additional iterative functions.

These iterative functions are not multi-threaded, which makes me think I should have utilized a strategy like computeMatrixStats() in Concat{Cols,Rows}. I think a better way would be to create a child class inheriting from MatrixLoader that finds matrix subsets based off of cell grouping. Then utilize the default MatrixLoader<T>::computeMatrixStats() and manipulate it to fit the output matrix we're looking for. This would probably allow for use of threading, while also limiting the amount of duplicate code.

Follow-up checklist:

(Added during code review)

Move row/colQuantiles into a different file. Perhaps make a matrixStats.R file that collects all of the matrixStats interface functions
Split function documentation to have the matrixStats interfaces as a page, and then maybe split up the arithmetic functions into a few groups too

change usage of percentile to quantile

…ix_quantile_per_cell()`

change return type of `pseudobulk_matrix()` into a named list. add in rcpp layer to pseudobulk and quantile functions.

immanuelazn · 2024-09-21T00:48:54Z

Added in the following changes

R interrupt checking as well as removal of Rcpp code from physical logic has been finished for pseudobulk_matrix() and matrix_quantile_per_cell(). These functions now follow the same styling as other functions, with their own cpp and header files.
pseudobulk_matrix() calculates everything within a single pass, rather than separate passes for each aggregation method. Using inspiration from calculateMatrixStats()
pseudobulk_matrix() now returns a named list built from a struct, as we calculate the simpler aggregation methods during our pass anyways.
matrix_quantile_per_cell() has been restricted to usage in non transposed matrices. Should the user call this on a transposed matrix, they will be advised to call transpose_storage_order() first.
Simple typo fixes and docstring clarity changes

bnprks

I've gotten through a detailed review pass, and ended up picking up on a lot of things I didn't notice in the first pass for high level feedback.

I think the most important change to think about is how to make the R interface of pseudobulk_matrix a nice as we can, particularly with my comment about return values of list vs. raw matrix return. We should agree on what the right solution for it is, but I put some of my thoughts for options in the comment.

There were a couple other efficiency comments scattered through the C++, happy to chat more if the rationale for any of those is unclear or if how to implement is uncertain.

r/R/singlecell_utils.R

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp

r/tests/testthat/test-pseudobulk.R

change `matrix_quantile_per_cell()` to S3 generalizable `colQuantile()` function change `colQuantile()` to use type 7 quantile calculation change `pseudobulk_matrix()` to use a numeric clip_values representing quantile rather than boolean set to .99 change `pseudobulk_matrix()` to return a single matrix rather than a named list when only one method given remove `matrix_quantile_per_row()` fix problem with sum calculation in `pseudobulk_matrix()` when requesting more complex method various documentation changes

immanuelazn · 2024-09-27T08:07:42Z

change matrix_quantile_per_cell() to S3 generalizable colQuantile() function
change colQuantile() to use type 7 quantile calculation
change pseudobulk_matrix() to use a numeric clip_values representing quantile rather than boolean set to .99
change pseudobulk_matrix() to return a single matrix rather than a named list when only one method given
remove matrix_quantile_per_row()
fix problem with sum calculation in pseudobulk_matrix() when requesting more complex method
various documentation changes

r/src/bpcells-cpp/matrixUtils/Quantile.cpp

r/R/singlecell_utils.R

r/R/transforms.R

bnprks

Here's a second round of changes. Good job overall with the various performance changes requested from last round. I commented a couple spots for further refinements within those changes you made.

One style change I've noticed a few times (perhaps copied from my own bad habits) is making a separate variable to track a vector.size(). This often just introduces more ways for things to go wrong. And I recently found from the docs that calling vector.resize() to a smaller size is guaranteed not to cause extra allocations/frees, so even when using a vector as just a memory buffer, it's always fine for performance to resize it down to the actual used size.

Other than that, I've been starting to get the hang of checking the generated assembly for what optimizations happened -- would be cool to demo that for you some time.

r/R/transforms.R

r/src/bpcells-cpp/matrixUtils/Pseudobulk.h

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp

r/src/bpcells-cpp/matrixUtils/Quantile.h

r/src/matrix_utils.cpp

r/R/singlecell_utils.R

r/pkgdown/_pkgdown.yml

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp

r/src/bpcells-cpp/matrixUtils/Quantile.cpp

r/tests/testthat/test-pseudobulk.R

immanuelazn · 2024-10-08T00:40:12Z

r/R/singlecell_utils.R

+
+
+#' Find the nth quantile value(s) of each row in a matrix. Only supports transposed matrices.


I noticed that I can't add params in, or else the IterableMatrix methods page gets all screwed up. It is probably a good time to start splitting that page up into a new section

Agreed. I think splitting into at least a new page with the matrixStats interface functions makes sense, though we can put that in a follow-up PR.

immanuelazn · 2024-10-08T00:44:27Z

Few notes:

Added rowQuantiles()
Change colQuantiles() to return a matrix if multiple probs are given
Change colQuantiles() to work with any of the continuous quantile algorithms (4-9)
Added tests for checking other types of quantiles, overall consolidation of quantile tests
Changed PseudobulkStatMethod to use bitlags
Changes pseudobulk_matrix() to not check the transpose conditional with every value check in the matrix. Also cleaned up the code and incorporated aforementioned bit flags

bnprks

Overall comments:

I think we should ditch the clipping option. I think the ideal clipping for pseudobulks would measure per-gene quantiles within each pseudobulk, then apply the clipping within each pseudobulk.
- A non-ideal but acceptable option would be to clip along rows. This at least applies the clipping per-gene, but might cause issues when a rare cell population expresses a certain gene much higher than other cells. Then the clipping might eliminate the high expression of that marker.
You should remember to worry about how invalid arguments get handled. You don't need to add unit tests for these, but I like to manually run a small example to check nothing bad happens if I pass a bad value to each argument in sequence. (See the code polishing checklist)
- In general, I find doing a little bit of manual testing is a good way to check error messages and edge cases more quickly. Those don't always require added tests (particularly edge cases that turn out to work correctly without any special handling in the code)
- Prefer col-major or row-major over saying "non-transposed" or "transposed". This is a little bit of a mismatch between the internal variable names BPCells uses and the external communication. Functions like storage_order and show return the row or col major language, and error messages should match that too.

I've also put these follow-up tasks into the initial description of this PR so we can remember to come back:

Move row/colQuantiles into a different file. Perhaps make a matrixStats.R file that collects all of the matrixStats interface functions
Split function documentation to have the matrixStats interfaces as a page, and then maybe split up the arithmetic functions into a few groups too

r/R/singlecell_utils.R

r/NEWS.md

r/R/singlecell_utils.R

r/src/bpcells-cpp/matrixUtils/Quantile.cpp

r/src/bpcells-cpp/matrixUtils/Quantile.h

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp

bnprks · 2024-10-08T22:24:39Z

r/tests/testthat/test-pseudobulk.R

+  expect_error(pseudobulk_matrix(m1, cell_group, method = "nonexistent_method"))
+  expect_error(pseudobulk_matrix(m1, cell_group, method = c("nonexistent_method", "sum")))


For these expect_error tests, it's often helpful to fill in the 2nd argument (regex on what the error message should say). In this case, you did have an error but it was uncontrolled rather than printing out something useful so the tests gave some false confidence.

r/tests/testthat/test-pseudobulk.R

Co-authored-by: Ben Parks <[email protected]>

immanuelazn · 2024-10-09T22:42:35Z

I think I have addressed all comments, I also gave it a pass with manually checking everything. As for the two points that you have put up, I put them into an issuue within the projects page, for a new PR. Thanks for being so patient and detailed with your review Ben

immanuelazn · 2024-10-09T22:47:22Z

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp

@@ -21,26 +21,29 @@ namespace BPCells {
 template <typename T>
 PseudobulkStats pseudobulk_matrix(std::unique_ptr<MatrixLoader<T>> &&mat,
                                  const std::vector<uint32_t>& cell_groups,
-                                  PseudobulkStatsMethod method,
+                                  int method,


Not entirely sure if it is 100% worth the syntactic cleanliness, if we have to pass these in now as ints. I feel like this kind of obfuscates how to use the method argument.

Yeah, I agree it's not obviously worth the trade-off. For the future if it seems like a review style suggestion isn't really working well, you can just not make the change and give a short reply with your rationale. (or ping me on slack if you want fast confirmation about ditching the suggestion)

bnprks

Sorry for the slow review, almost all there.

The pseudobulk_matrix function looks completely good now. You're right that switching the C++ bitflag code was not obviously beneficial -- sorry for the in retrospect unhelpful suggestion.

There are a few small issues still on the rowQuantiles and colQuantiles interfaces which I've commented in place -- rowQuantiles isn't quite forwarding all args and colQuantiles is returning transposed relative to matrixStats. As a reminder, you'll need to update the return type docs for colQuantiles once you fix it to match matrixStats.

r/R/singlecell_utils.R

bnprks · 2024-10-17T00:56:33Z

r/R/singlecell_utils.R

+  colnames(res) <- colnames(x)
+  if (length(probs) == 1) return(res[1,])
+  # `quantile()` from base R returns rownames as percentages, so we follow that convention
+  rownames(res) <- paste0(format(100 * probs, trim = TRUE, digits = digits), "%")


This is still returning a matrix transposed relative to matrixStats::colQuantiles(). rowQuantiles() appears to be fine, though

Ah, this is my bad. I thought that the transposed case was specific to rowQuantiles(). I think the obvious change is to make the test cases themselves use the matrixStats functions for non-iterablematrices so the behaviour is guaranteed to work as expected

r/NEWS.md

bnprks · 2024-10-17T18:39:50Z

r/R/singlecell_utils.R

+    MatrixGenerics::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, digits = digits, ..., useNames = useNames, drop = drop)
  } else if (requireNamespace("matrixStats", quietly = TRUE)) {
-    matrixStats::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, ...)
+    matrixStats::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, digits = digits, ..., useNames = useNames, drop = drop)


rowQuantiles is missing rows and cols argument forwarding. colQuantiles looks fine

bnprks · 2024-10-17T18:48:45Z

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp

@@ -21,26 +21,29 @@ namespace BPCells {
 template <typename T>
 PseudobulkStats pseudobulk_matrix(std::unique_ptr<MatrixLoader<T>> &&mat,
                                  const std::vector<uint32_t>& cell_groups,
-                                  PseudobulkStatsMethod method,
+                                  int method,


Yeah, I agree it's not obviously worth the trade-off. For the future if it seems like a review style suggestion isn't really working well, you can just not make the change and give a short reply with your rationale. (or ping me on slack if you want fast confirmation about ditching the suggestion)

Co-authored-by: Ben Parks <[email protected]>

immanuelazn · 2024-10-21T22:58:03Z

Changed quantile tests to use matrixStats functions, so we can ensure correct syntax and behaviour
Reverted PseudobulkStatsMethod bit flagging to no longer be cast as ints.
Transposed colQuantiles() to match matrixStats

bnprks

For quantile tests, need a skip_if_not_installed("matrixStats") call since matrixStats is only a test-time dependency so might not necessarily be installed
With the changed orientation of matrix_quantile_per_col(), the Quantile.h file comment needs updating to be accurate for the return value description
Line 50 of Quantile.cpp contains commented out code
singlecell_utils.R line 132 docstring description should say rowQuantiles for both references
singlecell_utils.R line 130, 193 need to prefix with - FunctionName(): to render properly in the website docs
Discussion: what file to put the quantile code in, and also what files to put tests in
- Quantile tests -> test_matrix_stats.R
- Quantile R implemenation -> initialize matrix_stats.R
- Pseudobulk tests -> test_singlecell_utils.R

immanuelazn added 10 commits September 5, 2024 12:58

[r][cpp] iterable pseudobulk approach wip

a2e4155

[r][cpp] wip

22cba8e

[r][cpp] add mean, change pseudobulk_counts_cpp() to use eigen

c5a015d

[r] add pseubulk test case

129cd67

[r][cpp] add test cases for pseudobulk aggregation

4c2cb78

[r][cpp] add matrix multiply version of pseudobulk_counts()

c11f904

[r] add multithreading to pseudobulks

2a92b44

[r][cpp] add percentile clipping to pseudobulk_counts()

896b2cb

[r][cpp] fix bugs with pseudobulk transpose

d8dfd47

change usage of percentile to quantile

[cpp] add in variance wip

58acb71

immanuelazn force-pushed the ia/pseudobulking-matrices branch from 8b6240b to ea89bda Compare September 18, 2024 05:52

[r][cpp] add non-zeros to pseudobulk_matrix(), fix mem issue in `matr…

7c7ab72

…ix_quantile_per_cell()`

immanuelazn force-pushed the ia/pseudobulking-matrices branch from ea89bda to 7c7ab72 Compare September 18, 2024 21:52

immanuelazn added 3 commits September 18, 2024 15:06

[r] remove unintended testing file

ce5809e

[r] fix small typos in stop messages

6132034

[r][cpp] leverage one pass logic for calculating pseudobulk_matrix().

8ff2ac9

change return type of `pseudobulk_matrix()` into a named list. add in rcpp layer to pseudobulk and quantile functions.

bnprks reviewed Sep 23, 2024

View reviewed changes

immanuelazn commented Sep 27, 2024

View reviewed changes

r/src/bpcells-cpp/matrixUtils/Quantile.cpp Outdated Show resolved Hide resolved

immanuelazn commented Sep 27, 2024

View reviewed changes

r/R/singlecell_utils.R Outdated Show resolved Hide resolved

immanuelazn commented Sep 27, 2024

View reviewed changes

r/R/singlecell_utils.R Outdated Show resolved Hide resolved

immanuelazn commented Sep 27, 2024

View reviewed changes

r/R/transforms.R Outdated Show resolved Hide resolved

[cpp] small refactor of matrix_quantile_per_col()

9ab448c

immanuelazn force-pushed the ia/pseudobulking-matrices branch from 9fdd898 to 9ab448c Compare September 27, 2024 19:02

[r] add in clip_values value checking for pseudobulk_matrix()

bbdc234

bnprks reviewed Sep 30, 2024

View reviewed changes

immanuelazn added 2 commits October 2, 2024 17:37

[cpp] fix pseudobulk imports

1c70d42

[r][cpp] add in pseudobulk_matrix() and quantile() fixes

16b8c1d

immanuelazn commented Oct 8, 2024

View reviewed changes

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp Show resolved Hide resolved

immanuelazn commented Oct 8, 2024

View reviewed changes

r/src/bpcells-cpp/matrixUtils/Pseudobulk.cpp Show resolved Hide resolved

immanuelazn commented Oct 8, 2024

View reviewed changes

r/src/bpcells-cpp/matrixUtils/Quantile.cpp Outdated Show resolved Hide resolved

immanuelazn commented Oct 8, 2024

View reviewed changes

r/tests/testthat/test-pseudobulk.R Outdated Show resolved Hide resolved

immanuelazn commented Oct 8, 2024

View reviewed changes

bnprks reviewed Oct 8, 2024

View reviewed changes

immanuelazn and others added 3 commits October 9, 2024 11:52

Update r/R/singlecell_utils.R

e5ad6f3

Co-authored-by: Ben Parks <[email protected]>

Apply pseudobulk suggestions from code review

fcc43fc

Co-authored-by: Ben Parks <[email protected]>

[r][cpp] add in suggested pseudobulk and quantile changes

448e2ae

immanuelazn commented Oct 9, 2024

View reviewed changes

immanuelazn added 2 commits October 9, 2024 15:48

[cpp] expand pseudobulk_matrix() docstring

856d834

[r][cpp] small row/colQuantiles() docs changes

be0ab27

bnprks reviewed Oct 17, 2024

View reviewed changes

immanuelazn and others added 4 commits October 21, 2024 13:54

Apply suggestions from code review

e6d50c4

Co-authored-by: Ben Parks <[email protected]>

[cpp] revert bitflaging pseudobulkstatsmethod changes

c62ef5b

[r][cpp] transpose colQuantiles to match matrixStats, update docs

d646968

[r] update quantile tests to use matrixStats functions

d02ade9

[r] skip quantile tests if matrixStats not installed

4a91e1f

bnprks reviewed Oct 21, 2024

View reviewed changes

immanuelazn added 6 commits October 21, 2024 17:34

[r][cpp] address requested PR changes for quantile

49b2c16

[r] move quantile and pseudobulk tests

46bc43c

[r] move {row,col}quantiles() to matrix_stats.R

c3fbaa4

[r] remove unintended wilcox test change

f7068d3

[r] change quantile usage from generic

051b150

[r] change quantile usage from generic

61a609c

immanuelazn merged commit 334f172 into main Oct 22, 2024

bnprks deleted the ia/pseudobulking-matrices branch October 25, 2024 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r][cpp] add pseudobulking to matrices #128

[r][cpp] add pseudobulking to matrices #128

immanuelazn commented Sep 18, 2024 •

edited by bnprks

Loading

immanuelazn commented Sep 21, 2024

bnprks left a comment

immanuelazn commented Sep 27, 2024

bnprks left a comment

immanuelazn Oct 8, 2024

bnprks Oct 8, 2024

immanuelazn commented Oct 8, 2024

bnprks left a comment

bnprks Oct 8, 2024

immanuelazn commented Oct 9, 2024

immanuelazn Oct 9, 2024

bnprks Oct 17, 2024

bnprks left a comment •

edited

Loading

bnprks Oct 17, 2024

immanuelazn Oct 21, 2024

bnprks Oct 17, 2024

bnprks Oct 17, 2024

immanuelazn commented Oct 21, 2024

bnprks left a comment •

edited by immanuelazn

Loading



		#' Find the nth quantile value(s) of each row in a matrix. Only supports transposed matrices.

		expect_error(pseudobulk_matrix(m1, cell_group, method = "nonexistent_method"))
		expect_error(pseudobulk_matrix(m1, cell_group, method = c("nonexistent_method", "sum")))

[r][cpp] add pseudobulking to matrices #128

[r][cpp] add pseudobulking to matrices #128

Conversation

immanuelazn commented Sep 18, 2024 • edited by bnprks Loading

Description

Tests

Details

Follow-up checklist:

immanuelazn commented Sep 21, 2024

bnprks left a comment

Choose a reason for hiding this comment

immanuelazn commented Sep 27, 2024

bnprks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

immanuelazn commented Oct 8, 2024

bnprks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

immanuelazn commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnprks left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

immanuelazn commented Oct 21, 2024

bnprks left a comment • edited by immanuelazn Loading

Choose a reason for hiding this comment

immanuelazn commented Sep 18, 2024 •

edited by bnprks

Loading

bnprks left a comment •

edited

Loading

bnprks left a comment •

edited by immanuelazn

Loading