Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r][cpp] add pseudobulking to matrices #128

Merged
merged 36 commits into from
Oct 22, 2024
Merged

Conversation

immanuelazn
Copy link
Collaborator

@immanuelazn immanuelazn commented Sep 18, 2024

Description

Added function matrix_quantile_per_cell() to find the nth quantile value of each cell in a matrix. Allows for clipping using min_by_row() and min_by_col()
Added pseudobulk_matrix() with option to clip by quantile, and to aggregate by non-zeros, mean, sum, variance.

Tests

  • For pseudobulk_matrix(), add tests for clipping, non-zeros, mean, sum, and variance in both transpose states.
  • For matrix_quantile_per_cell(), add test comparing against R quantile function across transpose states and quantile values

Details

I've iterated on pseudobulk_matrix() a few times as shown in commit history. I tried to be a little bit smarter by using matrix multiplies to utilize multi-threading. However, the solution for non-zeros and variance is probably non-optimal due to requiring the additional iterative functions.

These iterative functions are not multi-threaded, which makes me think I should have utilized a strategy like computeMatrixStats() in Concat{Cols,Rows}. I think a better way would be to create a child class inheriting from MatrixLoader that finds matrix subsets based off of cell grouping. Then utilize the default MatrixLoader<T>::computeMatrixStats() and manipulate it to fit the output matrix we're looking for. This would probably allow for use of threading, while also limiting the amount of duplicate code.

Follow-up checklist:

(Added during code review)

  • Move row/colQuantiles into a different file. Perhaps make a matrixStats.R file that collects all of the matrixStats interface functions
  • Split function documentation to have the matrixStats interfaces as a page, and then maybe split up the arithmetic functions into a few groups too

@immanuelazn immanuelazn force-pushed the ia/pseudobulking-matrices branch from 8b6240b to ea89bda Compare September 18, 2024 05:52
@immanuelazn immanuelazn force-pushed the ia/pseudobulking-matrices branch from ea89bda to 7c7ab72 Compare September 18, 2024 21:52
change return type of `pseudobulk_matrix()` into a named list.
add in rcpp layer to pseudobulk and quantile functions.
@immanuelazn
Copy link
Collaborator Author

Added in the following changes

  • R interrupt checking as well as removal of Rcpp code from physical logic has been finished for pseudobulk_matrix() and matrix_quantile_per_cell(). These functions now follow the same styling as other functions, with their own cpp and header files.
  • pseudobulk_matrix() calculates everything within a single pass, rather than separate passes for each aggregation method. Using inspiration from calculateMatrixStats()
  • pseudobulk_matrix() now returns a named list built from a struct, as we calculate the simpler aggregation methods during our pass anyways.
  • matrix_quantile_per_cell() has been restricted to usage in non transposed matrices. Should the user call this on a transposed matrix, they will be advised to call transpose_storage_order() first.
  • Simple typo fixes and docstring clarity changes

Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten through a detailed review pass, and ended up picking up on a lot of things I didn't notice in the first pass for high level feedback.

I think the most important change to think about is how to make the R interface of pseudobulk_matrix a nice as we can, particularly with my comment about return values of list vs. raw matrix return. We should agree on what the right solution for it is, but I put some of my thoughts for options in the comment.

There were a couple other efficiency comments scattered through the C++, happy to chat more if the rationale for any of those is unclear or if how to implement is uncertain.

change `matrix_quantile_per_cell()` to S3 generalizable `colQuantile()` function
change `colQuantile()` to use type 7 quantile calculation
change `pseudobulk_matrix()` to use a numeric clip_values representing quantile rather than boolean set to .99
change `pseudobulk_matrix()` to return a single matrix rather than a named list when only one method given
remove `matrix_quantile_per_row()`
fix problem with sum calculation in `pseudobulk_matrix()` when requesting more complex method
various documentation changes
@immanuelazn
Copy link
Collaborator Author

change matrix_quantile_per_cell() to S3 generalizable colQuantile() function
change colQuantile() to use type 7 quantile calculation
change pseudobulk_matrix() to use a numeric clip_values representing quantile rather than boolean set to .99
change pseudobulk_matrix() to return a single matrix rather than a named list when only one method given
remove matrix_quantile_per_row()
fix problem with sum calculation in pseudobulk_matrix() when requesting more complex method
various documentation changes

@immanuelazn immanuelazn force-pushed the ia/pseudobulking-matrices branch from 9fdd898 to 9ab448c Compare September 27, 2024 19:02
Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a second round of changes. Good job overall with the various performance changes requested from last round. I commented a couple spots for further refinements within those changes you made.

One style change I've noticed a few times (perhaps copied from my own bad habits) is making a separate variable to track a vector.size(). This often just introduces more ways for things to go wrong. And I recently found from the docs that calling vector.resize() to a smaller size is guaranteed not to cause extra allocations/frees, so even when using a vector as just a memory buffer, it's always fine for performance to resize it down to the actual used size.

Other than that, I've been starting to get the hang of checking the generated assembly for what optimizations happened -- would be cool to demo that for you some time.



#' Find the nth quantile value(s) of each row in a matrix. Only supports transposed matrices.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that I can't add params in, or else the IterableMatrix methods page gets all screwed up. It is probably a good time to start splitting that page up into a new section

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think splitting into at least a new page with the matrixStats interface functions makes sense, though we can put that in a follow-up PR.

@immanuelazn
Copy link
Collaborator Author

Few notes:

  • Added rowQuantiles()
  • Change colQuantiles() to return a matrix if multiple probs are given
  • Change colQuantiles() to work with any of the continuous quantile algorithms (4-9)
  • Added tests for checking other types of quantiles, overall consolidation of quantile tests
  • Changed PseudobulkStatMethod to use bitlags
  • Changes pseudobulk_matrix() to not check the transpose conditional with every value check in the matrix. Also cleaned up the code and incorporated aforementioned bit flags

Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall comments:

  • I think we should ditch the clipping option. I think the ideal clipping for pseudobulks would measure per-gene quantiles within each pseudobulk, then apply the clipping within each pseudobulk.
    • A non-ideal but acceptable option would be to clip along rows. This at least applies the clipping per-gene, but might cause issues when a rare cell population expresses a certain gene much higher than other cells. Then the clipping might eliminate the high expression of that marker.
  • You should remember to worry about how invalid arguments get handled. You don't need to add unit tests for these, but I like to manually run a small example to check nothing bad happens if I pass a bad value to each argument in sequence. (See the code polishing checklist)
    • In general, I find doing a little bit of manual testing is a good way to check error messages and edge cases more quickly. Those don't always require added tests (particularly edge cases that turn out to work correctly without any special handling in the code)
    • Prefer col-major or row-major over saying "non-transposed" or "transposed". This is a little bit of a mismatch between the internal variable names BPCells uses and the external communication. Functions like storage_order and show return the row or col major language, and error messages should match that too.

I've also put these follow-up tasks into the initial description of this PR so we can remember to come back:

  • Move row/colQuantiles into a different file. Perhaps make a matrixStats.R file that collects all of the matrixStats interface functions
  • Split function documentation to have the matrixStats interfaces as a page, and then maybe split up the arithmetic functions into a few groups too

Comment on lines 107 to 108
expect_error(pseudobulk_matrix(m1, cell_group, method = "nonexistent_method"))
expect_error(pseudobulk_matrix(m1, cell_group, method = c("nonexistent_method", "sum")))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these expect_error tests, it's often helpful to fill in the 2nd argument (regex on what the error message should say). In this case, you did have an error but it was uncontrolled rather than printing out something useful so the tests gave some false confidence.

@immanuelazn
Copy link
Collaborator Author

I think I have addressed all comments, I also gave it a pass with manually checking everything. As for the two points that you have put up, I put them into an issuue within the projects page, for a new PR. Thanks for being so patient and detailed with your review Ben

@@ -21,26 +21,29 @@ namespace BPCells {
template <typename T>
PseudobulkStats pseudobulk_matrix(std::unique_ptr<MatrixLoader<T>> &&mat,
const std::vector<uint32_t>& cell_groups,
PseudobulkStatsMethod method,
int method,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure if it is 100% worth the syntactic cleanliness, if we have to pass these in now as ints. I feel like this kind of obfuscates how to use the method argument.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it's not obviously worth the trade-off. For the future if it seems like a review style suggestion isn't really working well, you can just not make the change and give a short reply with your rationale. (or ping me on slack if you want fast confirmation about ditching the suggestion)

Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review, almost all there.

The pseudobulk_matrix function looks completely good now. You're right that switching the C++ bitflag code was not obviously beneficial -- sorry for the in retrospect unhelpful suggestion.

There are a few small issues still on the rowQuantiles and colQuantiles interfaces which I've commented in place -- rowQuantiles isn't quite forwarding all args and colQuantiles is returning transposed relative to matrixStats. As a reminder, you'll need to update the return type docs for colQuantiles once you fix it to match matrixStats.

Comment on lines 247 to 251
colnames(res) <- colnames(x)
if (length(probs) == 1) return(res[1,])
# `quantile()` from base R returns rownames as percentages, so we follow that convention
rownames(res) <- paste0(format(100 * probs, trim = TRUE, digits = digits), "%")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still returning a matrix transposed relative to matrixStats::colQuantiles(). rowQuantiles() appears to be fine, though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is my bad. I thought that the transposed case was specific to rowQuantiles(). I think the obvious change is to make the test cases themselves use the matrixStats functions for non-iterablematrices so the behaviour is guaranteed to work as expected

Comment on lines 146 to 148
MatrixGenerics::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, digits = digits, ..., useNames = useNames, drop = drop)
} else if (requireNamespace("matrixStats", quietly = TRUE)) {
matrixStats::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, ...)
matrixStats::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, digits = digits, ..., useNames = useNames, drop = drop)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rowQuantiles is missing rows and cols argument forwarding. colQuantiles looks fine

@@ -21,26 +21,29 @@ namespace BPCells {
template <typename T>
PseudobulkStats pseudobulk_matrix(std::unique_ptr<MatrixLoader<T>> &&mat,
const std::vector<uint32_t>& cell_groups,
PseudobulkStatsMethod method,
int method,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it's not obviously worth the trade-off. For the future if it seems like a review style suggestion isn't really working well, you can just not make the change and give a short reply with your rationale. (or ping me on slack if you want fast confirmation about ditching the suggestion)

@immanuelazn
Copy link
Collaborator Author

  • Changed quantile tests to use matrixStats functions, so we can ensure correct syntax and behaviour
  • Reverted PseudobulkStatsMethod bit flagging to no longer be cast as ints.
  • Transposed colQuantiles() to match matrixStats

Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For quantile tests, need a skip_if_not_installed("matrixStats") call since matrixStats is only a test-time dependency so might not necessarily be installed
  • With the changed orientation of matrix_quantile_per_col(), the Quantile.h file comment needs updating to be accurate for the return value description
  • Line 50 of Quantile.cpp contains commented out code
  • singlecell_utils.R line 132 docstring description should say rowQuantiles for both references
  • singlecell_utils.R line 130, 193 need to prefix with - FunctionName(): to render properly in the website docs
  • Discussion: what file to put the quantile code in, and also what files to put tests in
    • Quantile tests -> test_matrix_stats.R
    • Quantile R implemenation -> initialize matrix_stats.R
    • Pseudobulk tests -> test_singlecell_utils.R

@immanuelazn immanuelazn merged commit 334f172 into main Oct 22, 2024
@bnprks bnprks deleted the ia/pseudobulking-matrices branch October 25, 2024 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants