-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r][cpp] add pseudobulking to matrices #128
Conversation
change usage of percentile to quantile
8b6240b
to
ea89bda
Compare
…ix_quantile_per_cell()`
ea89bda
to
7c7ab72
Compare
change return type of `pseudobulk_matrix()` into a named list. add in rcpp layer to pseudobulk and quantile functions.
Added in the following changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gotten through a detailed review pass, and ended up picking up on a lot of things I didn't notice in the first pass for high level feedback.
I think the most important change to think about is how to make the R interface of pseudobulk_matrix
a nice as we can, particularly with my comment about return values of list vs. raw matrix return. We should agree on what the right solution for it is, but I put some of my thoughts for options in the comment.
There were a couple other efficiency comments scattered through the C++, happy to chat more if the rationale for any of those is unclear or if how to implement is uncertain.
change `matrix_quantile_per_cell()` to S3 generalizable `colQuantile()` function change `colQuantile()` to use type 7 quantile calculation change `pseudobulk_matrix()` to use a numeric clip_values representing quantile rather than boolean set to .99 change `pseudobulk_matrix()` to return a single matrix rather than a named list when only one method given remove `matrix_quantile_per_row()` fix problem with sum calculation in `pseudobulk_matrix()` when requesting more complex method various documentation changes
change |
9fdd898
to
9ab448c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a second round of changes. Good job overall with the various performance changes requested from last round. I commented a couple spots for further refinements within those changes you made.
One style change I've noticed a few times (perhaps copied from my own bad habits) is making a separate variable to track a vector.size()
. This often just introduces more ways for things to go wrong. And I recently found from the docs that calling vector.resize()
to a smaller size is guaranteed not to cause extra allocations/frees, so even when using a vector
as just a memory buffer, it's always fine for performance to resize it down to the actual used size.
Other than that, I've been starting to get the hang of checking the generated assembly for what optimizations happened -- would be cool to demo that for you some time.
r/R/singlecell_utils.R
Outdated
|
||
|
||
#' Find the nth quantile value(s) of each row in a matrix. Only supports transposed matrices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that I can't add params in, or else the IterableMatrix methods
page gets all screwed up. It is probably a good time to start splitting that page up into a new section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think splitting into at least a new page with the matrixStats
interface functions makes sense, though we can put that in a follow-up PR.
Few notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall comments:
- I think we should ditch the clipping option. I think the ideal clipping for pseudobulks would measure per-gene quantiles within each pseudobulk, then apply the clipping within each pseudobulk.
- A non-ideal but acceptable option would be to clip along rows. This at least applies the clipping per-gene, but might cause issues when a rare cell population expresses a certain gene much higher than other cells. Then the clipping might eliminate the high expression of that marker.
- You should remember to worry about how invalid arguments get handled. You don't need to add unit tests for these, but I like to manually run a small example to check nothing bad happens if I pass a bad value to each argument in sequence. (See the code polishing checklist)
- In general, I find doing a little bit of manual testing is a good way to check error messages and edge cases more quickly. Those don't always require added tests (particularly edge cases that turn out to work correctly without any special handling in the code)
- Prefer col-major or row-major over saying "non-transposed" or "transposed". This is a little bit of a mismatch between the internal variable names BPCells uses and the external communication. Functions like
storage_order
andshow
return the row or col major language, and error messages should match that too.
I've also put these follow-up tasks into the initial description of this PR so we can remember to come back:
- Move row/colQuantiles into a different file. Perhaps make a
matrixStats.R
file that collects all of the matrixStats interface functions - Split function documentation to have the
matrixStats
interfaces as a page, and then maybe split up the arithmetic functions into a few groups too
r/tests/testthat/test-pseudobulk.R
Outdated
expect_error(pseudobulk_matrix(m1, cell_group, method = "nonexistent_method")) | ||
expect_error(pseudobulk_matrix(m1, cell_group, method = c("nonexistent_method", "sum"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these expect_error
tests, it's often helpful to fill in the 2nd argument (regex on what the error message should say). In this case, you did have an error but it was uncontrolled rather than printing out something useful so the tests gave some false confidence.
Co-authored-by: Ben Parks <[email protected]>
Co-authored-by: Ben Parks <[email protected]>
I think I have addressed all comments, I also gave it a pass with manually checking everything. As for the two points that you have put up, I put them into an issuue within the projects page, for a new PR. Thanks for being so patient and detailed with your review Ben |
@@ -21,26 +21,29 @@ namespace BPCells { | |||
template <typename T> | |||
PseudobulkStats pseudobulk_matrix(std::unique_ptr<MatrixLoader<T>> &&mat, | |||
const std::vector<uint32_t>& cell_groups, | |||
PseudobulkStatsMethod method, | |||
int method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure if it is 100% worth the syntactic cleanliness, if we have to pass these in now as ints. I feel like this kind of obfuscates how to use the method argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree it's not obviously worth the trade-off. For the future if it seems like a review style suggestion isn't really working well, you can just not make the change and give a short reply with your rationale. (or ping me on slack if you want fast confirmation about ditching the suggestion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow review, almost all there.
The pseudobulk_matrix function looks completely good now. You're right that switching the C++ bitflag code was not obviously beneficial -- sorry for the in retrospect unhelpful suggestion.
There are a few small issues still on the rowQuantiles and colQuantiles interfaces which I've commented in place -- rowQuantiles isn't quite forwarding all args and colQuantiles is returning transposed relative to matrixStats
. As a reminder, you'll need to update the return type docs for colQuantiles once you fix it to match matrixStats
.
r/R/singlecell_utils.R
Outdated
colnames(res) <- colnames(x) | ||
if (length(probs) == 1) return(res[1,]) | ||
# `quantile()` from base R returns rownames as percentages, so we follow that convention | ||
rownames(res) <- paste0(format(100 * probs, trim = TRUE, digits = digits), "%") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still returning a matrix transposed relative to matrixStats::colQuantiles()
. rowQuantiles()
appears to be fine, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is my bad. I thought that the transposed case was specific to rowQuantiles()
. I think the obvious change is to make the test cases themselves use the matrixStats functions for non-iterablematrices so the behaviour is guaranteed to work as expected
r/R/singlecell_utils.R
Outdated
MatrixGenerics::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, digits = digits, ..., useNames = useNames, drop = drop) | ||
} else if (requireNamespace("matrixStats", quietly = TRUE)) { | ||
matrixStats::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, ...) | ||
matrixStats::rowQuantiles(x, probs = probs, na.rm = na.rm, type = type, digits = digits, ..., useNames = useNames, drop = drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rowQuantiles is missing rows
and cols
argument forwarding. colQuantiles looks fine
@@ -21,26 +21,29 @@ namespace BPCells { | |||
template <typename T> | |||
PseudobulkStats pseudobulk_matrix(std::unique_ptr<MatrixLoader<T>> &&mat, | |||
const std::vector<uint32_t>& cell_groups, | |||
PseudobulkStatsMethod method, | |||
int method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree it's not obviously worth the trade-off. For the future if it seems like a review style suggestion isn't really working well, you can just not make the change and give a short reply with your rationale. (or ping me on slack if you want fast confirmation about ditching the suggestion)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For quantile tests, need a
skip_if_not_installed("matrixStats")
call sincematrixStats
is only a test-time dependency so might not necessarily be installed - With the changed orientation of
matrix_quantile_per_col()
, theQuantile.h
file comment needs updating to be accurate for the return value description - Line 50 of
Quantile.cpp
contains commented out code -
singlecell_utils.R
line 132 docstring description should say rowQuantiles for both references -
singlecell_utils.R
line 130, 193 need to prefix with- FunctionName():
to render properly in the website docs - Discussion: what file to put the quantile code in, and also what files to put tests in
- Quantile tests ->
test_matrix_stats.R
- Quantile R implemenation -> initialize
matrix_stats.R
- Pseudobulk tests ->
test_singlecell_utils.R
- Quantile tests ->
Description
Added function
matrix_quantile_per_cell()
to find the nth quantile value of each cell in a matrix. Allows for clipping usingmin_by_row()
andmin_by_col()
Added
pseudobulk_matrix()
with option to clip by quantile, and to aggregate by non-zeros, mean, sum, variance.Tests
pseudobulk_matrix()
, add tests for clipping, non-zeros, mean, sum, and variance in both transpose states.matrix_quantile_per_cell()
, add test comparing against R quantile function across transpose states and quantile valuesDetails
I've iterated on
pseudobulk_matrix()
a few times as shown in commit history. I tried to be a little bit smarter by using matrix multiplies to utilize multi-threading. However, the solution for non-zeros and variance is probably non-optimal due to requiring the additional iterative functions.These iterative functions are not multi-threaded, which makes me think I should have utilized a strategy like
computeMatrixStats()
inConcat{Cols,Rows}
. I think a better way would be to create a child class inheriting fromMatrixLoader
that finds matrix subsets based off of cell grouping. Then utilize the defaultMatrixLoader<T>::computeMatrixStats()
and manipulate it to fit the output matrix we're looking for. This would probably allow for use of threading, while also limiting the amount of duplicate code.Follow-up checklist:
(Added during code review)
matrixStats.R
file that collects all of the matrixStats interface functionsmatrixStats
interfaces as a page, and then maybe split up the arithmetic functions into a few groups too