-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEW: Begin adding label_seqs #49
Conversation
This is looking good to me so far @Oddant1. I like the label/delabel functionality being available from the same action. Things I noticed that are missing at this point are:
Let me know if I can do anything to help - happy to have another look this morning. |
@gregcaporaso it turns out in regards to request 2
It's just as easy to display every id that is present in the sequences and not in the metadata. Should we still only display the first? I'm in favor of displaying all of them because if we only display the first and there are a very large number missing we could lull the user into thinking it's a small problem when in reality their data is completely blown up. If we show them all the user will immediately know the scale of the problem. Additionally, do we want a separate error for ids present in the metadata but not in the sequences? Or do we want to ignore this case? It seems like if we're going to worry about the sequence and metadata ids matching at all we should be making sure they match fully, but I don't know. ALSO NOTE: The description text has not been written yet. |
Yep, I agree - let's display all of them if it's just as easy.
No, we usually just ignore that case since we don't want to encourage having multiple copies of metadata because they tend to get out of sync really easily. That'll be particularly important here since after the genome-sampler run the user will have a small set of context sequences, but still have the metadata for all of the context sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall, just a few suggestions here and there on error messages and descriptions that I think will improve clarity.
Do we need a test that ensures sequence description fields are handled correctly? I'm not sure if those end up in the formats that we're working with here. For example, what happens if the sequence header line looks like:
>id1 hello world
In this case the hello world
shouldn't be considered part of the id, but I don't know if it ends up in the relabeled header or not. Seems like it wouldn't hurt to keep it, but I don't really have a strong feeling either way.
It should end up in the relabeled header. In this code it would basically be treated as a part of the id. |
That would probably be an issue since the metadata wouldn't have it, but I think that might not actually be what happens. I think the transformer to the Series differentiates the id and description fields and uses the id field only as the index. This is probably a non-issue - I think the description ends up getting stripped in this process, which is probably fine. |
Yep description is dropped in all cases here, so it isn't an issue. The transformer we are using also lacks a way to preserve it. This is consistent with some of the filter operations in q2-feature-table. |
actually on second thought, the transformer may be able to preserve it if we returned an skbio object, but I'm not especially motivated to preserve that... |
I agree - let's not worry about it now. |
Looks good to me, I just had one more minor comment. Good to go after that - I'm about to get on a call so I'll approve now but be sure to hit that last comment. |
@gregcaporaso the backticks in the description text don't underline. It's a bug in the cli. I will backtick it anyway to make it stand out. |
Is this one ready for merge, or is @ebolyen still reviewing? |
ready for merge! just got sidetracked
|
No description provided.