Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runCol argument in readQFeatures() #216

Open
samgregoire opened this issue Jul 29, 2024 · 1 comment
Open

runCol argument in readQFeatures() #216

samgregoire opened this issue Jul 29, 2024 · 1 comment
Assignees

Comments

@samgregoire
Copy link

Hello,

I have an issue / question regarding the runCol argument from the readQFeatures() function.
My sample annotations are stored in a data.frame called meta. The column containing the name of the runs from which the samples come from is called fileID. I thought the purpose of the runCol argument was to identify the column where the names of the runs are stored. However, when I set runCol = "fileID", I get this error saying that the 'colData' must contain a column called 'runCol'.

> readQFeatures(assayData = psm,
+               colData = meta,
+               runCol = "fileID",
+               quantCols = "quant",
+               sep = "_",
+               verbose = TRUE)
Checking arguments.
Error in .checkRunCol(assayData, colData, runCol) : 
  When 'runCol' is not NULL, 'colData' must contain a column called 'runCol'.

If I change the column name in my sample annotations from "fileID" to "runCol", everything works and my colData are exactly as I want them to be.

> colnames(meta)[colnames(meta) == "fileID"] <- "runCol"
> readQFeatures(assayData = psm,
+               colData = meta,
+               runCol = "fileID",
+               quantCols = "quant",
+               sep = "_",
+               verbose = TRUE)
Checking arguments.
Loading data as a 'SummarizedExperiment' object.
Splitting data in runs.
Formatting sample annotations (colData).
Formatting data as a 'QFeatures' object.
An instance of class QFeatures containing 215 assays:

Since there is a runCol argument, shouldn't I be free to name my "run" column whatever I want in the metadata as long as I indicate it through the said runCol argument? With the current implementation of the readQFeatures() function, it seems that I need the column to be imperatively named runCol.

From what I found, the QFeatures:::.checkRunCol() function is responsible for the error because if (!"runCol" %in% colnames(colData)) induces a stop.

function (assayData, colData, runCol) 
{
    if (is.null(runCol)) 
        return(NULL)
    if (length(runCol) > 1) 
        stop("'runCol' must contain the name of a single column ", 
            "in 'assayData'.")
    if (!runCol %in% colnames(assayData)) 
        stop("'", runCol, "' (provided as 'runCol') not found ", 
            "in 'assayData'.")
    runs <- assayData[[runCol]]
    if (!is.null(colData)) {
        if (!"runCol" %in% colnames(colData)) 
            stop("When 'runCol' is not NULL, 'colData' must ", 
                "contain a column called 'runCol'.")
        mis <- !runs %in% colData$runCol
        if (any(mis)) {
            warning("Some runs are missing in 'colData': ", paste0(unique(runs[mis]), 
                collapse = ", "))
        }
    }
    assayData[[runCol]]
}

I think that removing the quotation marks in the condition inducing the stop and using colData[[runCol]] instead of colData$runCol would solve this issue, unless of course that's how runCol is supposed to behave.

PS. The QFeatures:::.formatColData() function also uses the colData$runCol notation.

@cvanderaa
Copy link
Collaborator

cvanderaa commented Jul 29, 2024

(ping @leopoldguyot who raised the same issue to us)

When implementing this, I had in mind that we have to impose a "runCol" in the colData, but reading again the documentation, I get the confusion. In the param description:

runCol: For the multi-set case, a numeric(1) or character(1) pointing to the column of assayData (and colData, is set) that contains the runs/batches. Make sure that the column name in both tables are identical and syntactically valid (if you supply a character) or have the same index (if you supply a numeric). Note that characters are converted to syntactically valid names using make.names

Indeed, the doc mentions here that runCol must be the same across the two tables, but if the argument runCol is something else than "runCol", then the two table (in the current implementation) cannot share the same columns...

Later in the Details section:

Multi-set case: the colData must contain a column named quantCols that provides the names of the columns in assayData with the quantitative values for each sample, and a column named runCol that provides the MS runs/batches in which each sample has been acquired. The entries in colData[["runCol"]] are matched against the entries provided by assayData[[runCol]].

The first sentence mentions runCol as a variable (so can be any character) while the second sentence indicates that the column named "runCol" is matched, so here the name is imposed.

I agree this is super confusing and must be solved indeed, 2 solutions:

  1. As you mention, adapt .checkRunCol() so to allow that runCol points to a column name that is shared across the two tables. For instance, if runCol == "FileID, then readQFeatures() should expect that assayData and colData both contain a column named FileID.
  2. Fix the documentation to clarify that colData must contain a column named "runCol"

After discussion with Léopold and now your issue, you have convinced me of going for 1., but @lgatto what's your opinion?

@cvanderaa cvanderaa self-assigned this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants