-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Discussion: R one-based indexing vs. soma_joinid
#1232
Comments
soma_joinid
soma_joinid
Expectations for this issue:
|
Attempting to distill the key dilemma:
|
This is true for sparse arrays; not true for dense ones If we go 1-up everywhere that's fine -- data generated before that will be readable by Python, and data generated after that will be readable by Python and R ... ... all would be well as long as we forever abandon any hope of any return to dense matrices, ever. |
I fully agree that Throwing out an idea for discussion:
E.g., If an obs query is performed to find a set of rows, and the returned rows are used to access a sparse X, the My rambling thoughts that led up to the above proposal: |
@atolopko-czi Thanks, I think you're right that the idea of range-slicing on either I need to push back on one point though,
We have the specific challenge that the R |
@atolopko-czi I like your idea. A priori I don't like having two columns which (in practice) will differ only by one. However, other options are also fraught. As compromises go, your My only ask is that for |
@atolopko-czi my reasoning:
|
I fully agree with Mike's concerns about implicitly adding one to Regarding the challenge of indexing R sparse matrices with Here's an example using the snippet @eddelbuettel shared in slack earlier: i <- c(0,2,4)
j <- c(1,3,5)
x <- c(100,200,300)
M <- sparseMatrix(i, j, x = x, index1=FALSE, repr="T")
dimnames(M) <- list(
as.character(seq.int(0, max(i))),
as.character(seq.int(0, max(j)))
)
M
## 5 x 6 sparse Matrix of class "dgTMatrix"
## 0 1 2 3 4 5
## 0 . 100 . . . .
## 1 . . . . . .
## 2 . . . 200 . .
## 3 . . . . . .
## 4 . . . . . 300 This explicitly treats M["0", "1"]
## [1] 100 Of course, there's still potential for making the same off-by-one error if they forget to make that conversion. M[0, 1]
## numeric(0) But we're still maintaining the integrity of |
I'd like to throw my support behind constraining Arrays in X layers, Note: I'm going to write in Seurat terminology (because it's easier for me and consistent with Seurat vignettes/help pages and additional documents I've prepared about v5); a Seurat <> SOMA dictionary is provided below the break Seurat <> SOMA terminology dictionary
In Seurat v3, every assay must have the same cells, in the same order, and every slot within each assay must have the same features. Features did not have to be consistent between assays, only within. All of this was done using character indexing, because it's easy. However, one major goal of Seurat v5 was to break both of these constraints on the For Seurat v5, we allow assays to not only have different features, but now different cells. We also allow layers within assays to have different cells and features from one another. We do this by defining a canonical set of cells that are present in the
Our resulting canonical set will be
And both In SOMA (much like Seurat v5), the sole purpose of I cannot see a reason for the user to care about Furthermore, because we will have the full space of As for whether or not the R API should automatically translate
Which I think is fine because this shouldn't be an issue with the already existing SOMAs and will formalize how we (likely) already treat |
Thanks @mojaveazure , I think your input will rightly be very influential to our direction here. I'm largely in agreement, but need to press us on this point --
I heard two ideas for this, neither of which is fully ideal:
LMK if I missed a better approach? If not, then these seem not much better to me than the status quo, which requires user to explicitly add one to (To be clear, I think it should be doable to have the R bindings implicitly add one to any dataframe column or array dimension named |
One possible way of automatically adjusting
The other possible place to do it would be in the @eddelbuettel @aaronwolen any POV on these ideas? I'll admit I'm a bit less bullish on rewriting the id's (as part of "Option 1") after looking into more here. I don't suppose it's possible to do the equivalent of |
Going into the The real trouble remains that there will be a difference between such a 'view' and what is do disk. Which may be the pill we have to swallow and accompany with lots of 'public service announcements' about the 'one delta' between representation and materilization. |
@eddelbuettel flipping this in the opposite direction, I think might be the best solution so far!
|
When are users going to be realizing a SOMA array as a matrix, then accessing via Moreover, when are users going to be using All in all, because > library(bit64)
> library(Matrix)
> sparse_mat <- sparseMatrix(i = c(0, 1, 2), j = c(0, 2, 1), x = c(3, 5, 1), index1 = FALSE, dims = c(4, 4))
> sparse_mat
4 x 4 sparse Matrix of class "dgCMatrix"
[1,] 3 . . .
[2,] . . 5 .
[3,] . 1 . .
[4,] . . . .
> soma_joinids <- seq.integer64(0, 2)
> soma_joinids
integer64
[1] 0 1 2
> sparse_mat[soma_joinids, soma_joinids]
Error in sparse_mat[soma_joinids, soma_joinids] :
invalid or not-yet-implemented 'Matrix' subsetting |
@mojaveazure It's a fair question...the original motivating example for filing this issue was this draft R vignette (a port of this python notebook), which accesses a SOMASparseNDArray indexing which genes are present in which datasets. The key operation illustrated is taking a subset of I defer to you on how realistic a use case this is, but, I was tasked to port the notebook 😅 (There are multiple things wrong with the current draft vignette right now; I need to revise it thoroughly once we resolve this) I'm going to keep hacking on |
For this task, I would wager that the presence matrix would be more useful if it was named rather than relying on integer indexing. Seurat v5 is introducing presence matrices (already available in v4, but not widely used at the moment), if you'd like to see how we handle this task I'm also not sure
And these can be either Example: (rough idea off the top of my head, haven't tested) # define for CsparseMatrix, RsparseMatrix, and TsparseMatrix
# define for: (or `methods::setClassUnion("anynumber", c("integer", "numeric"))`)
# - i = numeric, j = numeric
# - i = numeric, j = missing
# - i = missing, j = numeric
# - i = integer, j = integer
# - i = integer, j = missing
# - i = missing, j = integer
# - i = integer, j = numeric
# - i = numeric, j = integer
# Can (probably) get away with defining as a separate function and embedding that as `definition`
setMethod(
f = "[",
signature = c(x = "CsparseMatrix", i = "numeric", j = "numeric")
definition = function(x, i, j, ..., drop = FALSE, index0 = FALSE) {
if (isTRUE(index0)) {
i <- i + 1L
j <- j + 1L
}
return(callNextMethod(x, i, j, ..., drop = drop))
}
) |
@mojaveazure @eddelbuettel Here's a fleshing out of https://gist.github.com/mlin/03a3a93a4197bbfc464985eb5e0684a3#file-zerobasedsparsematrix-r (For those trying to follow along: this is a possible distinct solution the key dilemma by wrapping sparseMatrix so that it accepts zero-based indexes directly. Although that's unconventional for R, multiple commenters have noted that users ought not care about the specific values in the |
@eddelbuettel @mojaveazure @aaronwolen Here's the monstrosity fully formed with tests passing. I certainly have reservations, but I think at least it's viable... https://github.com/single-cell-data/TileDB-SOMA/pull/1306/files |
@eddelbuettel @mojaveazure @aaronwolen @johnkerl After sleeping on this again, I realized that the zero-based shim would be much better as a distinct wrapper object instead of subclassing all the My confidence is partly based on separate effort to implement the next-best alternative, excluding zero from the domain of |
Nice work. I quite like what you did there -- the 'view-only' shim does what we want. One somewhat trivial addition is to outlaw writing to such a matrix. I.e. using your test data > library(tiledbsoma)
> sm <- Matrix::sparseMatrix(i = 1:3, j = 1:3, x = c(41, 42, 43))
> mat <- matrixZeroBasedView(sm)
> mat[0:1, 0:1]
2 x 2 sparse Matrix of class "dgCMatrix"
[1,] 41 .
[2,] . 42
> mat[0, 0]
[1] 41
> mat[0, 0] <- 42
Error: I'm sorry, Dave. I'm afraid I can't do that.
> Not with the exact same error message, but you get the idea. My quick draft used modified apis/r/R/utils-matrixZeroBasedView.R
@@ -46,6 +46,22 @@ matrixZeroBasedView <- function(one_based_matrix) {
}
}
+
+#' Zero-based matrix element assigment
+#'
+#' This method errors as a read-only view is implemented.
+#'
+#' @param x The matrix viewed in zero-based mode.
+#' @param i Row index, zero-based
+#' @param j Column index, zero-based
+#' @param val The to-be assigned value
+#'
+#' @return Nothing as the method errors.
+#' @export
+`[<-.matrixZeroBasedView` <- function(x, i, j, value) {
+ stop("I'm sorry, Dave. I'm afraid I can't do that.", call. = FALSE)
+}
+
#' dim
#'
#' @param x (I assume here we would never want to assign into the matrix. I could be wrong on that.) |
…a_joinid lookup (#1313) To load a one-based `Matrix::sparseMatrix` with the contents of a zero-based `SOMASparseNDArray`, we add one to the offsets on each dimension. However, these dimensions are usually populated with `soma_joinid` intended to match the `soma_joinid` column in the `obs` & `var` data frames, and shifting them by one makes this join operation very error-prone. Here we introduce a minimal shim for `sparseMatrix` that provides matrix access with zero-based indexes. To make this explicit for the user, we rename `SOMASparseNDArray$read_sparse_matrix()` to `SOMASparseNDArray$read_sparse_matrix_zero_based()`. The shim supports only minimal access operations, which is intentional to prevent "mixing" it with conventional one-based objects. If needed, the fully-featured `sparseMatrix` is recovered by calling `as.one.based()` on the shim. Thus, the default behavior is to match `soma_joinid` but an advanced user can explicitly change to one-based indexing for further use in R. Extended discussion: #1232
This ticket is for capturing discussions & decisions!
In the R ecosystem all vector and array indexing is one-based, which users/devs are highly accustomed to.
The SOMA abstract specification defines both dense & sparse arrays to have zero-based offset indexing. Although we need to reinterpret that slightly for R, that's not inherently problematic because it's entirely natural to treat R array offsets/positions as one-based instead.
Bigger papercuts arise from
soma_joinid
, which we define as follows:Note that
soma_joinid
can take the value 0, but needn't -- in principle they can be any unique numbers throughout the specified domain (although in current practice they do usually equal zero-based row numbers in theobs
/var
dataframes).This will cause friction any time we have an NDArray dimension indexed by a
soma_joinid
, including allX
matrices and the Census' feature-dataset presence matrix. Becausesoma_joinid
can take the value zero, it seems we need to add one to it in order to populate a one-based dimension of an RsparseMatrix
:TileDB-SOMA/apis/r/R/SOMASparseNDArray.R
Lines 174 to 175 in 4e00906
Yet this is in some sense inappropriate, because
soma_joinid
is defined as an actual ID, not an offset/row number.Here's an example R vignette / upstream Python notebook; where, because the R
sparseMatrix
dimensions are effectivelysoma_joinid+1
, to access it bysoma_joinid
found from other dataframes, we need to explicitly add one to those IDs.A few strawperson ideas for discussion:
soma_joinid
everywhere, including dataframes as well assparseMatrix
dimensions.soma_joinid
is just the row number in theobs
/var
dataframe? (see final note below)soma_joinid
.soma_joinid
with row number in the var/obs dataframe. (This could be a pro or con depending on how you look at it!)Final note: this is also related to discussion/debate on the utility of
SOMADenseNDArray
. If we really want to support dense arrays (which the spec generally usessoma_joinid
for the dimensions) then it seems we'd need to tighten the definition ofsoma_joinid
to at least roughly correspond toobs
/var
dataframe row numbers. Otherwise I can say I have one (1) cell withsoma_joinid = 2^63-1
, and that's valid yet effectively unrepresentable with dense arrays.The text was updated successfully, but these errors were encountered: