-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding metadata information to saving .npz files #229
Comments
Note this command is used here. |
This is a very simple change, and it is implemented in branch The following is the new output format:
@gwaygenomics I'll send you example outputs to get your feedback. |
this looks great! I see lots of potential in storing data this way - it creates a permanent link between the DeepProfiler features and metadata. Two clarification questions: How to use
|
Perhaps I can include both |
I realized this discussion belongs in https://github.com/broadinstitute/DeepProfilerExperiments - I will cross post and link. We can continue the decisions there. |
That's correct, this implementation is copying the record of the To clarify, the So to your questions:
I agree that things should be consistent. I think keeping the metadata in the I wonder if it would be easier to load all DeepProfiler feature files with one routine and then transform them into a format that is more compatible with cytominer-database. We have example code in DeepProfilerExperiments on how to do that, basically read the Thanks for linking the issue in DeepProfilerExperiments. I think we can discuss the format of DeepProfiler features here (which impacts this repository) and the integration with pycytominer there (when we agree how the format should be). |
For sure - I agree that the DeepProfiler output shouldn't be tinkered, except maybe to add the metadata dictionary to the
In order for DeepProfiler aggregated profiles (level 3 data) to be compatible with other pycytominer tools, we will need to use pandas DataFrames (for aggregated profiles only, not single cell data) that are in the same format as current level 3 CellProfiler output (metadata columns, followed by feature columns). Am I understanding your concern correctly? Does this make sense?
My goal is to have this exact metadata information (Plate, Well, and Site) linked to the single cell profiles. I don't like them being embedded in the file name, but if this is what we decide, that is fine with me too! :) I also totally appreciate how difficult it is (and costly!) to introduce this sort of enhancement, so I am happy to go with what we decide is best, after weighing these factors that are outside my knowledge. Is there an alternative way to link single cell profiles to these three metadata features?
👍 |
Yes! Precisely! Except for this I'd like to sidestep cytominer-database. It is currently hard to use, and I don't see the upside to But I will look into it... it might be nice to be globally consistent |
Thanks for clarifying @gwaygenomics !
Got it. Yes, this makes sense.
Good point. Here are some alternatives to naming the files with Plate, Well and Site:
To be honest, I don't have a preference as long as the feature vectors don't change. I'm happy to keep all the metadata in the |
Love it! Right now, the way I see it is to introduce the function How does DeepProfiler typically encode the network used? Is this info in the
Cool. My vote is to have the metadata (at least Plate, Well, Site) in the npz file and also include the So, to summarize, the pycytominer process will first look to the The reason to allow for both is for backwards compatibility with legacy DeepProfiler datasets that only have |
Sounds good! The updated format has been implemented in 913fb32
Note that the dictionary in the
I will share example data with this format! |
@gwaygenomics note that there is a |
Here is a compressed file with example features. |
As mentioned in the other discussion thread, we can easily recompute features with the new format that we agree on. I didn't realize you recommended an index file with paths to features, and we can produce that as an output of the feature extraction process. The path to files can also be included in our current metadata file, the I will work on implementing this feature as an output of DeepProfiler and will report back here soon. |
I assume the paths to feature files should be relative to a root folder, as features can be moved from one environment to another and absolute paths don't transfer well. Is this correct @gwaygenomics ? |
I think that my comment in broadinstitute/DeepProfilerExperiments#2 (comment) (also pasted below) is the source of this recommendation:
If I'm understanding correctly, I don't actually recommend this approach. Linking file paths is also fragile for the reason you mention about absolute paths, and also because the absolute path can change without any consequence to the I think the right approach is to encode all metadata information in the .npz file so that we can remove the need to also pass along the index csv from the perspective of pycytominer. I totally see the value of retaining the index.csv file for other uses (for one, its way easier to look at than an |
OK. Yes, I think we want to make the output of DeepProfiler compatible with pycytominer and we also want to make it easy to integrate. So in conclusion, for the purposes of pycytominer integration:
This output will be the official format for DeepProfiler. I will create a PR to merge it and recompute the features in the datasets that we are working with at the moment. After closing this thread, we will continue the conversation of how to use pycytominer in the downstream analysis, which is maintained in the other repository. |
Here is PR #240 with the changes. |
@Arkkienkeli has merged the code and the implementation is now part of the master development! |
One minor clarification question: Should this field be used as the DeepProfiler morphology feature prefix? (like |
Slight modifications to parsing plate, well, site info from filenames fixed in cytomining/pycytominer#210 |
np.savez_compressed()
can receive multiple arguments. We should consider saving metadata information here in addition to encoding the info in the file name. File names can be overwritten and can be tough to extract from.The text was updated successfully, but these errors were encountered: