Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Discussion: R one-based indexing vs. soma_joinid #1232

Closed
mlin opened this issue Apr 8, 2023 · 20 comments
Closed

[r] Discussion: R one-based indexing vs. soma_joinid #1232

mlin opened this issue Apr 8, 2023 · 20 comments

Comments

@mlin
Copy link
Member

mlin commented Apr 8, 2023

This ticket is for capturing discussions & decisions!


In the R ecosystem all vector and array indexing is one-based, which users/devs are highly accustomed to.

The SOMA abstract specification defines both dense & sparse arrays to have zero-based offset indexing. Although we need to reinterpret that slightly for R, that's not inherently problematic because it's entirely natural to treat R array offsets/positions as one-based instead.

Bigger papercuts arise from soma_joinid, which we define as follows:

Every SOMADataFrame must contain a column called soma_joinid, of type int64 and domain [0, 2^63-1]. The soma_joinid column contains a unique value for each row in the SOMADataFrame, and intended to act as a joint key for other objects, such as SOMASparseNDArray.

Note that soma_joinid can take the value 0, but needn't -- in principle they can be any unique numbers throughout the specified domain (although in current practice they do usually equal zero-based row numbers in the obs/var dataframes).

This will cause friction any time we have an NDArray dimension indexed by a soma_joinid, including all X matrices and the Census' feature-dataset presence matrix. Because soma_joinid can take the value zero, it seems we need to add one to it in order to populate a one-based dimension of an R sparseMatrix:

Matrix::sparseMatrix(i = 1 + as.numeric(tbl$GetColumnByName("soma_dim_0")),
j = 1 + as.numeric(tbl$GetColumnByName("soma_dim_1")),

Yet this is in some sense inappropriate, because soma_joinid is defined as an actual ID, not an offset/row number.

Here's an example R vignette / upstream Python notebook; where, because the R sparseMatrix dimensions are effectively soma_joinid+1, to access it by soma_joinid found from other dataframes, we need to explicitly add one to those IDs.


A few strawperson ideas for discussion:

  1. Implicitly add one to soma_joinid everywhere, including dataframes as well as sparseMatrix dimensions.
    • Reservations about implicitly changing an ID, not an offset/row number.
    • Will there be various corner cases where we might miss the implicit adjustment?
    • Should we then admit that soma_joinid is just the row number in the obs/var dataframe? (see final note below)
  2. Amend SOMA spec to exclude zero from the domain of soma_joinid.
    • Is a breaking spec change after we declared v1.0, that would also trigger code changes to Census builder at minimum.
    • Would probably also reveal corners where we've conflated soma_joinid with row number in the var/obs dataframe. (This could be a pro or con depending on how you look at it!)
  3. Don't change code or spec; just rely on thorough documentation with prominent warnings & explanation.

Final note: this is also related to discussion/debate on the utility of SOMADenseNDArray. If we really want to support dense arrays (which the spec generally uses soma_joinid for the dimensions) then it seems we'd need to tighten the definition of soma_joinid to at least roughly correspond to obs/var dataframe row numbers. Otherwise I can say I have one (1) cell with soma_joinid = 2^63-1, and that's valid yet effectively unrepresentable with dense arrays.

@johnkerl johnkerl changed the title [r] discussion- R one-based indexing vs. soma_joinid [r] Discussion: R one-based indexing vs. soma_joinid Apr 10, 2023
@johnkerl
Copy link
Member

Expectations for this issue:

  • There is, of course, no magical, one-please-all solution
  • Let's get a write-up of the various choices available and the plusses & minuses for each
  • Schedule a team meeting
  • Get buy-in from the team
  • Formulate a plan

@mlin
Copy link
Member Author

mlin commented Apr 17, 2023

Attempting to distill the key dilemma:

  1. Hopefully not too controversial: array & matrix offsets should be zero-based in Python API and one-based in R API.
  2. Important distinction: soma_joinid is not an offset, as the SOMA spec currently defines the base concept; it's just an id for each observable/variable (needs not start at zero/one, nor form a contiguous range).
    • Corollary: it's not really appropriate for our R stuff to implicitly add one to soma_joinid, since it's an id, not an offset.
  3. However, SOMA also uses soma_joinid as the axis indices/offsets for X matrices (and the Census feature/dataset presence matrix)
  4. Therefore, for populating the R sparseMatrix with one-based dimensions, the rock = soma_joinid may be zero; hard place = it seems wrong to implicitly add one to it.

Option Pros Cons
Option 1: Tighten definiton of soma_joinid to be the offset (row number) in the obs/var dataframe.
  • R bindings need to implicitly add one to to soma_joinid in a few places
  • Precludes soma_joinid from being stable across database revisions (e.g. if datasets are withdrawn)
  • Theoretically (but not in practice) a breaking change to SOMA 1.0
Option 2: Exclude zero from the domain of soma_joinid.
  • soma_joinids would not differ between R & Python (as expected for "id")
  • Continues allowing for stable IDs
  • Simpler R code
  • Breaking change to SOMA 1.0 (in both theory and practice; albeit not a huge one)
  • Punts questions about dense arrays
Option 3: status quo
  • Least timeline impact
  • R users have to expressly add one to soma_joinid when accessing X and similarly-indexed arrays
  • Otherwise prone to silent off-by-one bugs
  • Punts questions about dense arrays

@johnkerl
Copy link
Member

johnkerl commented Apr 17, 2023

soma_joinid is not an offset, as the SOMA spec currently defines the base concept

This is true for sparse arrays; not true for dense ones

If we go 1-up everywhere that's fine -- data generated before that will be readable by Python, and data generated after that will be readable by Python and R ...

... all would be well as long as we forever abandon any hope of any return to dense matrices, ever.

@atolopko-czi
Copy link
Contributor

I fully agree that soma_joinid is not an offset, it's just an identifier (key).

Throwing out an idea for discussion:

  • The obs and var dataframes should maintain both a soma_joinid and offset.
  • The offset is the (TileDB) Dim and the soma_joinid becomes a (TileDB) Attr of obs and var DataFrames, rather than it's Dim. Since this would be newly introduced, we could have it be 1-based. Schema changes might look like this:
    • In obs and var: Dim(name='soma_offset', domain=(1, N)), Attr(name='soma_joinid', domain=(0,N-1))
    • In dense arrays: Dim(name='soma_offset_dim_{0,1}', domain=(1, N))
    • In sparse arrays: Dim(name="soma_joinid_dim_{0,1}', domain=(0, N-1))
  • The soma_joinid is used for accessing sparse arrays.
  • The soma_offset is used for accessing dense arrays.

E.g., If an obs query is performed to find a set of rows, and the returned rows are used to access a sparse X, the obs.soma_joinid values are used to retrieve data from X. If a dense obsm (embedding) is accessed, the obs.soma_offset values are used to retrieve data from obsm.

My rambling thoughts that led up to the above proposal:
Querying obs or var for a contiguous slice of rows [N:M] is biologically meaningless. Similarly, one never (meaningfully) needs to retrieve a contiguous slice along an X axis. Retrieving a contiguous range of obs, var or dense array is only useful if you're grabbing an arbitrary set of rows for testing or inspection. So I don't really understand the cases where an R or Python user needs to think about the "meaning" of an offset or soma_joinid value, or explicitly specifying a 0 or 1 value in a query. If one is intent on querying a DataFrame or Matrix (either dense or sparse!) for an arbitrary range, you do so using the offsets. And if these offsets happen to be 0-based, then yes, R users need to consider this when making these arbitrary range queries. If an R user is querying obs to get offsets for a dense array, and the dense array is 0-based, then the R user needs to do the 1-biasing on all the offset values. But as proposed above, we can just go with 1-based Dims for dense arrays and make R users happy. Querying sparse arrays should be straightforward, since soma_joinid values can just be blindly passed through from an obs query result to a sparse matrix query; I don't see that an R user would be "handling" these values in a way where they cared whether a 0 value existed in the set of soma_joinids; a 0 value would only be "seen" if you're debugging your code and inspect the id values.

@mlin
Copy link
Member Author

mlin commented Apr 21, 2023

@atolopko-czi Thanks, I think you're right that the idea of range-slicing on either soma_joinid or offset is ill-conceived.

I need to push back on one point though,

I don't see that an R user would be "handling" these values in a way where they cared whether a 0 value existed in the set of soma_joinids; a 0 value would only be "seen" if you're debugging your code and inspect the id values.

We have the specific challenge that the R sparseMatrix axes begin at 1, so we simply cannot instantiate it with a soma_joinid == 0 unless we somehow change the ids. It's annoying that this "technicality" is spec-breaking for the (primary) sparse case, but I think, unfortunately, that is the case =(

@johnkerl
Copy link
Member

johnkerl commented Apr 21, 2023

@atolopko-czi I like your idea.

A priori I don't like having two columns which (in practice) will differ only by one. However, other options are also fraught. As compromises go, your soma_offset/soma_joinid approach puts the burden of +1/-1 solely and squarely at creation time and from everywhere downstream no other software needs to remember (or risk forgetting) any shift-by-one logic. I think the risk-reduction justifies the data-storage cost of us having another column in obs and var.

My only ask is that for obs and var we keep soma_joinid as the dim and make soma_offset an attr, rather than the other way around -- as at present most of our arrays are sparse, and the soma_offset (which is destined for dense arrays) should be a bit off to the side, and the soma_joinid (which is for sparse arrays) should be more front-and-center.

@johnkerl
Copy link
Member

johnkerl commented Apr 21, 2023

@atolopko-czi my reasoning:

  • Perhaps I should let go of ... using soma_offset as the dim on obs and var leaves open the possibility of someday storing obs and var as dense dataframes ...
  • However I stick with my position: better to have what we do currently use (soma_joinid) remain the dim before and after this change, in terms of backward compatibility, and in terms of simplifying the code.

@aaronwolen
Copy link
Member

I fully agree with Mike's concerns about implicitly adding one to soma_joinid. As I mentioned in our meeting last week (but failed to write up here—my apologies), presenting a user with a value that differs from what's on disk—even by one—feels wrong and confusing.

Regarding the challenge of indexing R sparse matrices with soma_joinds. If we concede they are indeed IDs and not offsets (at least in the sparse case) we could consider populating the dimension labels of sparse matrices with the soma_joinids.

Here's an example using the snippet @eddelbuettel shared in slack earlier:

i <- c(0,2,4)
j <- c(1,3,5)
x <- c(100,200,300)
M <- sparseMatrix(i, j, x = x, index1=FALSE, repr="T")

dimnames(M) <- list(
  as.character(seq.int(0, max(i))),
  as.character(seq.int(0, max(j)))
)
M

## 5 x 6 sparse Matrix of class "dgTMatrix"
##   0   1 2   3 4   5
## 0 . 100 .   . .   .
## 1 .   . .   . .   .
## 2 .   . . 200 .   .
## 3 .   . .   . .   .
## 4 .   . .   . . 300

This explicitly treats soma_joinids as keys/IDs and provides a visual reminder that they're 0-based. It's also possible to index the matrix using the original soma_joinid values if they're passed as a string/character.

M["0", "1"]
## [1] 100

Of course, there's still potential for making the same off-by-one error if they forget to make that conversion.

M[0, 1]
## numeric(0)

But we're still maintaining the integrity of soma_joinid and allowing users to index by ID or by offsets.

@mojaveazure
Copy link
Member

mojaveazure commented Apr 21, 2023

I'd like to throw my support behind constraining soma_joinid to be an offset rather than an ID. Specifically, I'd like to require that obs has a soma_joinid ranging from [0, n_obs] sequentially with no gaps, and that each var has a soma_joinid ranging from [0, m_var] sequentially with no gaps. These both would be generated during SOMA creation and are consistent within the various sub-arrays of a given SOMA

Arrays in X layers, obsm, varm, obsp, and varp would have soma_joinid in the in the ranges [0, max(obs$soma_joinid)] and [0, max(var$soma_joinid)] sequentially, but with gaps allowed. This stems from Seurat v5 and how we handle gapped matrices

Note: I'm going to write in Seurat terminology (because it's easier for me and consistent with Seurat vignettes/help pages and additional documents I've prepared about v5); a Seurat <> SOMA dictionary is provided below the break

Seurat <> SOMA terminology dictionary
  • cells/cell name: obs_id
  • features/feature name: var_id
  • cell-level meta data: obs
  • feature-level meta data: var
  • assay: measurement
  • slot (v3) or layer (v5): X layer
  • Seurat object: SOMA experiment

In Seurat v3, every assay must have the same cells, in the same order, and every slot within each assay must have the same features. Features did not have to be consistent between assays, only within. All of this was done using character indexing, because it's easy. However, one major goal of Seurat v5 was to break both of these constraints on the Seurat object

For Seurat v5, we allow assays to not only have different features, but now different cells. We also allow layers within assays to have different cells and features from one another. We do this by defining a canonical set of cells that are present in the Seurat object, as well as a canonical set of cells and features present. When creating a Seurat object in v5, we gather a list of all unique cell and feature names across all matrices and assays provided and index them sequentially from 1 to N_cells/M_features. We then order each matrix and assay based on the sequential index of cells and features. Finally, we store the canonical cell and feature index and order in presence matrices and use this as our map between cell or feature name and index in the matrix. For example, if we have two matrices

  • matrix m1 with cells cell1, cell2, cell4, and cell5
  • matrix m2 with cells cell2, cell3, and cell5

Our resulting canonical set will be

  1. cell1
  2. cell2
  3. cell4
  4. cell5
  5. cell3

And both m1 and m2 will be reordered based on those indexes. As such, Seurat index 4 will always refer to cell5, and we can refer back to our presence matrices to figure out where index 4 is for both m1 and m2; this index is valid for this Seurat object only. When creating a new Seurat object, including new versions of the same data, new indexes are generated for the set of cells and features provided. The sole purpose of the index is to join cell and feature names to physical location within the layer matrices and other sub-objects

In SOMA (much like Seurat v5), the sole purpose of soma_joinid would to match a meaningful term (obs_id, var_id, other bit of meta data in obs or var) to an index that can be quickly queried in every other array within a SOMA. We can obviate the need for presence matrices as the soma_joinid will be present in each array, so we can simply find where the requested soma_joinids are and pull those dimensions

I cannot see a reason for the user to care about soma_joinid, either under the status quo or under my proposal, so applying constraints shouldn't matter. Moreover, this constraint would allow gapped arrays in obsm/varm/X layers/etc, enabling full support for Seurat v5

Furthermore, because we will have the full space of soma_joinid for both obs and var, and we know that soma_joinid is sequential, it will allow us to easily pad gapped arrays into full matrices when needed for Seurat v3 and Scanpy/Anndata

As for whether or not the R API should automatically translate soma_joinid from 0-based to 1-based, I don't think that's necessary. As pointed out by @eddelbuettel and @aaronwolen, we can create sparse matrices using zero-based indexes and soma_joinid is generally useless to the end-user. Under this proposal, the only con from @mlin's list is

Theoretically (but not in practice) a breaking change to SOMA 1.0

Which I think is fine because this shouldn't be an issue with the already existing SOMAs and will formalize how we (likely) already treat soma_joinid

@mlin
Copy link
Member Author

mlin commented Apr 26, 2023

Thanks @mojaveazure , I think your input will rightly be very influential to our direction here. I'm largely in agreement, but need to press us on this point --

As for whether or not the R API should automatically translate soma_joinid from 0-based to 1-based, I don't think that's necessary. As pointed out by @eddelbuettel and @aaronwolen, we can create sparse matrices using zero-based indexes

I heard two ideas for this, neither of which is fully ideal:

  1. (due to @aaronwolen above) Set dimnames(sparseMatrix) to as.character(soma_joinid) -- but then end user has to always access the matrix in this unusual way, as numeric row/col indexing is still one-based.
  2. (due to @eddelbuettel in slack) Initialize with sparseMatrix(i,j,data,index1=FALSE) -- accepts zero-based inputs, but then one still accesses the matrix using one-based indexes.

LMK if I missed a better approach? If not, then these seem not much better to me than the status quo, which requires user to explicitly add one to soma_joinid to access any sparseMatrix indexed thereby.

(To be clear, I think it should be doable to have the R bindings implicitly add one to any dataframe column or array dimension named soma_joinid, so that the values in the obs/var dataframe column match the indexes with which the sparseMatrix is naturally accessed; but, it's a complication to account for in the decision matrix here.)

@mlin
Copy link
Member Author

mlin commented Apr 26, 2023

One possible way of automatically adjusting soma_joinid as we read out the data frame: rewrite the column in SOMADataFrame::soma_reader_transform(), something like:

soma_reader_transform = function(x) {
    tab <- arrow::as_arrow_table(arrow::RecordBatch$import_from_c(x[[1]], x[[2]]))
    tab$soma_joinid <- tab$soma_joinid + 1
    tab
}

The other possible place to do it would be in the as.data.frame(arrow::Table), but we'd need to make our own subclass of arrow::Table to override that implementation.

@eddelbuettel @aaronwolen any POV on these ideas? I'll admit I'm a bit less bullish on rewriting the id's (as part of "Option 1") after looking into more here. I don't suppose it's possible to do the equivalent of SELECT soma_joinid+1 when we send the query into TileDB before the above points?

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Apr 26, 2023

Going into the transform function may work. I stumbled across the index1=FALSE option when I was contemplating such shims. "Conceptually" we could shield all access to sparseMatrix (an internal, non-user facing) package by a layer of indirection giving the user a 'one-adjusted view' of the data. (Similar to your SELECT soma_joindid + 1.)

The real trouble remains that there will be a difference between such a 'view' and what is do disk. Which may be the pill we have to swallow and accompany with lots of 'public service announcements' about the 'one delta' between representation and materilization.

@mlin
Copy link
Member Author

mlin commented Apr 26, 2023

"Conceptually" we could shiedl all access to sparseMatrix (an internal, non-user facing) package by a layer of indirection giving the user a 'one-adjusted view' of the data. (Similar to your SELECT soma_joindid + 1.)

@eddelbuettel flipping this in the opposite direction, I think might be the best solution so far!

library(Matrix)

setClass("sparseMatrixZeroBased", contains = "dgCMatrix")
sparseMatrixZeroBased <- function (...) {
  new("sparseMatrixZeroBased", sparseMatrix(...))
}
setMethod("[", signature(x = "sparseMatrixZeroBased", i = "numeric", j = "numeric"),
          function (x, i, j) {
            as(x, "dgCMatrix")[i + 1, j + 1]
          })

sparse_mat <- sparseMatrixZeroBased(i = c(0, 1, 2),
                                    j = c(0, 2, 1),
                                    x = c(3, 5, 1),
                                    index1 = FALSE,
                                    dims = c(4, 4))
> print(sparse_mat)
4 x 4 sparse Matrix of class "sparseMatrixZeroBased"
            
[1,] 3 . . .
[2,] . . 5 .
[3,] . 1 . .
[4,] . . . .
> print(sparse_mat[0:1,0:1])
2 x 2 sparse Matrix of class "dgCMatrix"
        
[1,] 3 .
[2,] . .

@mojaveazure
Copy link
Member

When are users going to be realizing a SOMA array as a matrix, then accessing via soma_joinid? AFAIKT, we don't have a matrix-like interface to SOMA. We allow users to read arrays in as sparse matrices, but how often are users going to be doing that directly instead of reading a SOMA in as a Seurat or SingleCellExperiment object? Both of those objects have names instead of join IDs, so users will use names (or subset() queries or whatever SCE offers) to slice their data. Furthermore, because soma_joinid is so difficult to access, I can't see any user pulling it for use

Moreover, when are users going to be using soma_joinid at all? At best, a user could perform a query similar to which() to get the index of obs/var that matches some filter, but we don't offer that functionality. We don't offer a way to slice SOMAs (eg. [.SOMAExperiment) in general; instead, we offer experiment queries, which handle all indexing themselves

All in all, because soma_joinid is generally useless to the end user, I don't think we need to worry about 1-based or 0-based. If we really are worried, we could ensure that soma_joinid is always realized as an integer64, which will cause errors when trying to subset a Matrix

> library(bit64)
> library(Matrix)
> sparse_mat <- sparseMatrix(i = c(0, 1, 2), j = c(0, 2, 1), x = c(3, 5, 1), index1 = FALSE, dims = c(4, 4))
> sparse_mat
4 x 4 sparse Matrix of class "dgCMatrix"
            
[1,] 3 . . .
[2,] . . 5 .
[3,] . 1 . .
[4,] . . . .
> soma_joinids <- seq.integer64(0, 2)
> soma_joinids
integer64
[1] 0 1 2
> sparse_mat[soma_joinids, soma_joinids]
Error in sparse_mat[soma_joinids, soma_joinids] : 
  invalid or not-yet-implemented 'Matrix' subsetting

@mlin
Copy link
Member Author

mlin commented Apr 26, 2023

@mojaveazure It's a fair question...the original motivating example for filing this issue was this draft R vignette (a port of this python notebook), which accesses a SOMASparseNDArray indexing which genes are present in which datasets. The key operation illustrated is taking a subset of soma_joinids from the var/obs dataframe and using those to access the corresponding bits from the presence matrix -- highly prone to silent off-by-one errors currently.

I defer to you on how realistic a use case this is, but, I was tasked to port the notebook 😅 (There are multiple things wrong with the current draft vignette right now; I need to revise it thoroughly once we resolve this)

I'm going to keep hacking on sparseMatrixZeroBased today to see if it has some fatal flaw, but otherwise I think it may be the simplest solution for our key dilemma.

@mojaveazure
Copy link
Member

mojaveazure commented Apr 27, 2023

For this task, I would wager that the presence matrix would be more useful if it was named rather than relying on integer indexing. Seurat v5 is introducing presence matrices (already available in v4, but not widely used at the moment), if you'd like to see how we handle this task

I'm also not sure sparseMatrixZeroBased is the correct solution; we allow users to pull matrices in three representations

  • TsparseMatrix (internal default)
  • RsparseMatrix (not commonly used)
  • CsparseMatrix (most common)

And these can be either dMatrix or lMatrix objects (I'm exploring the latter for use with v5 presence matrices). sparseMatrixZeroBased is an implementation of dgCMatrix, but doesn't account for the other permutations of sparse matrices. If you want to extend sparse matrices, I might (hesitantly) recommend overriding [ for Matrix objects

Example: (rough idea off the top of my head, haven't tested)

# define for CsparseMatrix, RsparseMatrix, and TsparseMatrix
# define for: (or `methods::setClassUnion("anynumber", c("integer", "numeric"))`)
# - i = numeric, j = numeric
# - i = numeric, j = missing
# - i = missing, j = numeric
# - i = integer, j = integer
# - i = integer, j = missing
# - i = missing, j = integer
# - i = integer, j = numeric
# - i = numeric, j = integer
# Can (probably) get away with defining as a separate function and embedding that as `definition`
setMethod(
  f = "[",
  signature = c(x = "CsparseMatrix", i = "numeric", j = "numeric")
  definition = function(x, i, j, ..., drop = FALSE, index0 = FALSE) {
    if (isTRUE(index0)) {
      i <- i + 1L
      j <- j + 1L
    }
    return(callNextMethod(x, i, j, ..., drop = drop))
  }
)

@mlin
Copy link
Member Author

mlin commented Apr 27, 2023

@mojaveazure @eddelbuettel Here's a fleshing out of sparseMatrixZeroBased based on extraordinary abuse of the dynamic class system. I'm both proud of and horrified by it. Going to sleep on this one =)

https://gist.github.com/mlin/03a3a93a4197bbfc464985eb5e0684a3#file-zerobasedsparsematrix-r

(For those trying to follow along: this is a possible distinct solution the key dilemma by wrapping sparseMatrix so that it accepts zero-based indexes directly. Although that's unconventional for R, multiple commenters have noted that users ought not care about the specific values in the soma_joinid matrix dimensions, only that they're consistent with those found in the obs/var dataframe -- and this could provide that without changing the SOMA spec or transforming the soma_joinid values at runtime.)

@mlin
Copy link
Member Author

mlin commented Apr 28, 2023

@eddelbuettel @mojaveazure @aaronwolen Here's the monstrosity fully formed with tests passing. I certainly have reservations, but I think at least it's viable...

https://github.com/single-cell-data/TileDB-SOMA/pull/1306/files

@mlin
Copy link
Member Author

mlin commented Apr 30, 2023

@eddelbuettel @mojaveazure @aaronwolen @johnkerl After sleeping on this again, I realized that the zero-based shim would be much better as a distinct wrapper object instead of subclassing all the sparseMatrix implementations. This is probably what Dirk had in mind originally, but I went down a false path in #1306 from the initially-promising starting point above. The new approach in #1313 is much cleaner and less error-prone, and I'm now confident this is the best solution at last. Please review, officially!

My confidence is partly based on separate effort to implement the next-best alternative, excluding zero from the domain of soma_joinid. That has to touch more parts of our collective system, including a Census rebuild, and is also technically a breaking spec change to SOMA 1.0. And I haven't gotten it to actually work yet =( For reference: https://github.com/chanzuckerberg/cellxgene-census/pull/426/files https://github.com/single-cell-data/TileDB-SOMA/pull/1312/files

cc @pablo-gar @maniarathi

@eddelbuettel
Copy link
Contributor

eddelbuettel commented Apr 30, 2023

Nice work. I quite like what you did there -- the 'view-only' shim does what we want.

One somewhat trivial addition is to outlaw writing to such a matrix. I.e. using your test data

> library(tiledbsoma)
> sm <- Matrix::sparseMatrix(i = 1:3, j = 1:3, x = c(41, 42, 43))
> mat <- matrixZeroBasedView(sm)
> mat[0:1, 0:1]
2 x 2 sparse Matrix of class "dgCMatrix"
          
[1,] 41  .
[2,]  . 42
> mat[0, 0]
[1] 41
> mat[0, 0] <- 42
Error: I'm sorry, Dave. I'm afraid I can't do that.
> 

Not with the exact same error message, but you get the idea. My quick draft used

modified   apis/r/R/utils-matrixZeroBasedView.R
@@ -46,6 +46,22 @@ matrixZeroBasedView <- function(one_based_matrix) {
   }
 }
 
+
+#' Zero-based matrix element assigment
+#'
+#' This method errors as a read-only view is implemented.
+#'
+#' @param x The matrix viewed in zero-based mode.
+#' @param i Row index, zero-based
+#' @param j Column index, zero-based
+#' @param val The to-be assigned value
+#'
+#' @return Nothing as the method errors.
+#' @export
+`[<-.matrixZeroBasedView` <- function(x, i, j, value) {
+    stop("I'm sorry, Dave. I'm afraid I can't do that.", call. = FALSE)
+}
+
 #' dim
 #'
 #' @param x

(I assume here we would never want to assign into the matrix. I could be wrong on that.)

mlin added a commit that referenced this issue May 4, 2023
…a_joinid lookup (#1313)

To load a one-based `Matrix::sparseMatrix` with the contents of a zero-based `SOMASparseNDArray`, we add one to the offsets on each dimension. However, these dimensions are usually populated with `soma_joinid` intended to match the `soma_joinid` column in the `obs` & `var` data frames, and shifting them by one makes this join operation very error-prone. 

Here we introduce a minimal shim for `sparseMatrix` that provides matrix access with zero-based indexes. To make this explicit for the user, we rename `SOMASparseNDArray$read_sparse_matrix()` to `SOMASparseNDArray$read_sparse_matrix_zero_based()`. The shim supports only minimal access operations, which is intentional to prevent "mixing" it with conventional one-based objects. If needed, the fully-featured `sparseMatrix` is recovered by calling `as.one.based()` on the shim. Thus, the default behavior is to match `soma_joinid` but an advanced user can explicitly change to one-based indexing for further use in R.

Extended discussion: #1232
@mlin mlin closed this as completed May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants