Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW: Begin adding label_seqs #49

Merged
merged 17 commits into from
Jul 8, 2020
Merged

Conversation

Oddant1
Copy link
Contributor

@Oddant1 Oddant1 commented Jul 7, 2020

No description provided.

@gregcaporaso
Copy link
Member

gregcaporaso commented Jul 8, 2020

This is looking good to me so far @Oddant1. I like the label/delabel functionality being available from the same action. Things I noticed that are missing at this point are:

  1. descriptions for the new action in plugin_setup.py
  2. a test for the case where the metadata is missing an id that is present in the sequences (an error message indicating the first sequence id that is missing from the metadata would be ideal)
  3. a test for the case where a column requested by the user is not present in the metadata (an error message indicating the column name would be ideal)

Let me know if I can do anything to help - happy to have another look this morning.

@Oddant1
Copy link
Contributor Author

Oddant1 commented Jul 8, 2020

@gregcaporaso it turns out in regards to request 2

  1. a test for the case where the metadata is missing an id that is present in the sequences (an error message indicating the first sequence id that is missing from the metadata would be ideal)

It's just as easy to display every id that is present in the sequences and not in the metadata. Should we still only display the first? I'm in favor of displaying all of them because if we only display the first and there are a very large number missing we could lull the user into thinking it's a small problem when in reality their data is completely blown up. If we show them all the user will immediately know the scale of the problem.

Additionally, do we want a separate error for ids present in the metadata but not in the sequences? Or do we want to ignore this case? It seems like if we're going to worry about the sequence and metadata ids matching at all we should be making sure they match fully, but I don't know.

ALSO NOTE: The description text has not been written yet.

@gregcaporaso
Copy link
Member

It's just as easy to display every id that is present in the sequences and not in the metadata. Should we still only display the first? I'm in favor of displaying all of them...

Yep, I agree - let's display all of them if it's just as easy.

Additionally, do we want a separate error for ids present in the metadata but not in the sequences?

No, we usually just ignore that case since we don't want to encourage having multiple copies of metadata because they tend to get out of sync really easily. That'll be particularly important here since after the genome-sampler run the user will have a small set of context sequences, but still have the metadata for all of the context sequences.

@gregcaporaso gregcaporaso self-requested a review July 8, 2020 19:42
Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall, just a few suggestions here and there on error messages and descriptions that I think will improve clarity.

Do we need a test that ensures sequence description fields are handled correctly? I'm not sure if those end up in the formats that we're working with here. For example, what happens if the sequence header line looks like:

>id1 hello world

In this case the hello world shouldn't be considered part of the id, but I don't know if it ends up in the relabeled header or not. Seems like it wouldn't hurt to keep it, but I don't really have a strong feeling either way.

@Oddant1
Copy link
Contributor Author

Oddant1 commented Jul 8, 2020

I don't know if it ends up in the relabeled header or not.

It should end up in the relabeled header. In this code it would basically be treated as a part of the id.

@gregcaporaso
Copy link
Member

In this code it would basically be treated as a part of the id.

That would probably be an issue since the metadata wouldn't have it, but I think that might not actually be what happens. I think the transformer to the Series differentiates the id and description fields and uses the id field only as the index. This is probably a non-issue - I think the description ends up getting stripped in this process, which is probably fine.

@ebolyen
Copy link
Member

ebolyen commented Jul 8, 2020

Yep description is dropped in all cases here, so it isn't an issue. The transformer we are using also lacks a way to preserve it. This is consistent with some of the filter operations in q2-feature-table.

@ebolyen
Copy link
Member

ebolyen commented Jul 8, 2020

actually on second thought, the transformer may be able to preserve it if we returned an skbio object, but I'm not especially motivated to preserve that...

@gregcaporaso
Copy link
Member

I'm not especially motivated to preserve that

I agree - let's not worry about it now.

@gregcaporaso
Copy link
Member

gregcaporaso commented Jul 8, 2020

Looks good to me, I just had one more minor comment. Good to go after that - I'm about to get on a call so I'll approve now but be sure to hit that last comment.

@Oddant1
Copy link
Contributor Author

Oddant1 commented Jul 8, 2020

@gregcaporaso the backticks in the description text don't underline. It's a bug in the cli. I will backtick it anyway to make it stand out.

@gregcaporaso
Copy link
Member

Is this one ready for merge, or is @ebolyen still reviewing?

@ebolyen
Copy link
Member

ebolyen commented Jul 8, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants