Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve consequence predictions #690

Merged
merged 24 commits into from
Feb 25, 2025
Merged

fix: improve consequence predictions #690

merged 24 commits into from
Feb 25, 2025

Conversation

tedil
Copy link
Contributor

@tedil tedil commented Feb 22, 2025

Changes

  • be specific whether a frameshift_variant shortens or lengthens by annotating SO terms frameshift_truncation and frameshift_elongation
  • handle cases where a (cDNA) frameshift effectively resolves to a missense on protein level (this requires re-obtaining the reference sequence and integrating the variant via the AltBuilder, as hgvs-rs does not provide all necessary information for this use-case)
  • define inframe_{insertion,deletion} in terms of CDS, as suggested in their SO term definition
  • fix some {5,3}_prime_UTR_variant cases
  • in the case of a delIns (on protein level):
    • always annotate protein_altering_variant (unless it resolves to a missense)
    • evaluate length change (between reference and alternative) to differentiate between inframe_deletion, inframe_insertion and missense_variant (previously would always report deletion)

New consequence terms

TODO

  • update test cases

Summary by CodeRabbit

  • Chores

    • Updated dependencies and API specification to enhance overall stability.
  • New Features

    • Expanded variant consequence reporting with additional annotations for frameshift events and protein-altering changes.
    • Enhanced variant analysis with improved normalization and detailed protein evaluations.
  • Bug Fixes

    • Improved sequence data retrieval with clearer, more robust error messaging.
    • Refined handling of optional location information for more consistent outputs.

@tedil tedil self-assigned this Feb 22, 2025
Copy link
Contributor

coderabbitai bot commented Feb 22, 2025

Walkthrough

This PR updates several components by revising dependency versions and enhancing variant consequence handling. The hgvs dependency is bumped in Cargo.toml, while the OpenAPI schema now supports additional consequence types. Updates in annotation modules introduce new enum variants for frameshift consequences and adjust the impact logic. The variant prediction and protein analysis flow in the code has been refined, and sequence retrieval now benefits from improved error handling. Additionally, the HgncEntry structure now permits missing location information, and test metadata tracked by Git LFS has been updated.

Changes

File(s) Change Summary
Cargo.toml Updated dependency hgvs from version 0.18.0 to 0.18.1.
openapi.schema.yaml Updated version from 0.30.1 to 0.31.0; added consequence types: frameshift_elongation, frameshift_truncation, and protein_altering_variant.
src/annotate/seqvars/{ann.rs, csq.rs, provider.rs} In ann.rs, new variants FrameshiftElongation and FrameshiftTruncation were added and ProteinAlteringVariant reinstated in the Consequence enum. In csq.rs, prediction logic was refined (including normalization and protein analysis updates), and in provider.rs, error handling in sequence retrieval was enhanced.
src/db/check/mod.rs Changed HgncEntry's location and location_sortable fields from String to Option<String>.
tests/.../vep.disagreement-cases.expected.vcf Git LFS metadata updated: new oid and increased file size (from 17,967 bytes to 18,657 bytes).

Sequence Diagram(s)

sequenceDiagram
    participant V as Variant Input
    participant CP as ConsequencePredictor
    participant PA as ProteinAnalyzer
    V->>CP: Request consequence prediction
    alt Alternative variant equals "N"
        CP->>V: Use reference allele for prediction
    else 
        CP->>PA: Analyze protein variant\n(includes tx_accession)
        PA-->>CP: Return detailed consequence (frameshift, missense, etc.)
    end
    CP->>V: Return final predicted consequence
Loading
sequenceDiagram
    participant C as Client
    participant P as Provider
    C->>P: Request sequence for accession (e.g., "NC...")
    P->>P: Map accession to chromosome & attempt sequence retrieval
    alt Sequence retrieved successfully
        P->>C: Return sequence data
    else
        P->>C: Fallback: retrieve using original accession
        alt Sequence found after fallback
            P->>C: Return sequence data
        else
            P->>C: Return error ("No sequence record found")
        end
    end
Loading

Possibly Related PRs

Suggested Reviewers

  • holtgrewe

Poem

I’m a rabbit in the code-field, hopping with delight,
New frameshifts and variants shining bright.
Dependencies updated, logic refined with care,
Each commit a sweet carrot, fresh and rare.
Through twists of code I joyfully roam,
Celebrating these changes, right at home.
🥕🐇 Happy coding under moonlit night!

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@tedil tedil marked this pull request as ready for review February 25, 2025 15:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/annotate/seqvars/provider.rs (1)

578-591: Consider more specific error handling and unwrap safety.

  1. In line 579, calling unwrap() on reference_reader can cause a panic if reference_reader is unexpectedly None. Since you already check self.reference_available() before reaching this block, it might be safe in practice. Still, returning a more explicit error or using expect("...") could improve clarity.
  2. Lines 589-591 use Error::NoSequenceRecord when decoding UTF-8 fails. A separate error variant like Error::InvalidSequenceEncoding might help distinguish encoding issues from missing sequence data.
src/annotate/seqvars/csq.rs (3)

691-709: Refine frameshift vs. protein-level outcomes check.

This block removes the CDS-level frameshift consequence if a more specific protein-level frameshift or non-empty result is found. While logical, beware of scenarios in which a frameshift might have multiple parallel impacts (e.g., partial in-frame region). Consider tests ensuring no unintended consequence gets suppressed.


944-944: New 'conservative' parameter.

The signature includes a new boolean parameter to distinguish conservative from disruptive changes. Make sure to document its effect in function comments and confirm it’s correctly set by all callers.


991-1030: Comprehensive handling of in-frame vs. frameshift.

These lines thoroughly check reference/alternative lengths modulo 3, properly tagging frameshifts and in-frame edits. Consider test cases for empty strings or extremely large differences to confirm robust boundary handling (e.g., large deletions).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea2f40e and f51ecc0.

⛔ Files ignored due to path filters (4)
  • Cargo.lock is excluded by !**/*.lock
  • src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193311167-ATGT-T.snap is excluded by !**/*.snap
  • src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193311170-TGGCG-G.snap is excluded by !**/*.snap
  • src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193409913-ATAA-A.snap is excluded by !**/*.snap
📒 Files selected for processing (7)
  • Cargo.toml (1 hunks)
  • openapi.schema.yaml (3 hunks)
  • src/annotate/seqvars/ann.rs (4 hunks)
  • src/annotate/seqvars/csq.rs (15 hunks)
  • src/annotate/seqvars/provider.rs (1 hunks)
  • src/db/check/mod.rs (1 hunks)
  • tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Schema
  • GitHub Check: Testing
  • GitHub Check: build-and-push-image
🔇 Additional comments (15)
Cargo.toml (1)

48-48:

✅ Verification successful

Version update looks good.

The update of the hgvs dependency from 0.18.0 to 0.18.1 is appropriate. This minor version bump likely includes bug fixes related to variant consequence predictions.

Let's verify the changelog or release notes for this version to confirm the changes:


🏁 Script executed:

#!/bin/bash
# Check the latest release notes for hgvs crate

echo "Checking for release information for hgvs 0.18.1..."
curl -s "https://crates.io/api/v1/crates/hgvs/0.18.1" | jq '.version.crate_size, .version.created_at, .version.updated_at'

# Check git repository for release notes
echo "Checking git repository for release notes..."
curl -s "https://api.github.com/search/repositories?q=hgvs+language:rust" | jq '.items[0].html_url'

Length of output: 544


Update Verified: Dependency hgvs version update confirmed

  • Confirmed that Cargo.toml now specifies hgvs = "0.18.1" (line 48).
  • The output from crates.io shows the release was created on 2025-02-25, and the GitHub repository (https://github.com/varfish-org/hgvs-rs) confirms the updated release information.
  • The version bump is appropriate, likely incorporating the expected bug fixes for variant consequence predictions.
src/db/check/mod.rs (1)

157-158:

✅ Verification successful

Make location fields optional to handle missing data.

Good change to make both location and location_sortable optional, allowing the application to handle cases where HGNC entries don't have location information. This enhances robustness when dealing with incomplete data sources.

Let's verify if there are actual HGNC entries that might have missing location information:


🏁 Script executed:

#!/bin/bash
# Search for any possible instances where location information might be missing

echo "Searching for handling of missing location information in the codebase..."
rg "location: None" -A 2 -B 2
rg "location_sortable: None" -A 2 -B 2

echo "Checking how location fields are accessed in the code..."
rg "\.location" --type rust
rg "\.location_sortable" --type rust

Length of output: 1696


Verified: Optional Location Fields Handling Confirmed

The changes in src/db/check/mod.rs (lines 157–158) correctly update both location and location_sortable to be optional. Our investigation shows that similar handling of missing location data is already present elsewhere (for example, in src/db/create/mod.rs where map_location is set to None), which confirms that some HGNC entries indeed lack this information. Although the usage of record.location in src/verify/seqvars.rs still performs operations like splitting the string, the overall design now explicitly acknowledges that location data may be missing. Please ensure any direct operations on these fields properly account for a potential None value elsewhere in the codebase.

openapi.schema.yaml (2)

11-11: Version bump aligned with new features.

Version updated from 0.30.1 to 0.31.0 to reflect the addition of new consequence types. This follows semantic versioning principles correctly, using a minor version increment for backward-compatible feature additions.


415-416:

✅ Verification successful

New consequence types added to the schema.

The addition of frameshift_elongation, frameshift_truncation, and protein_altering_variant to the Consequence schema improves the precision of variant classification. These new terms directly support the PR objectives of enhancing consequence predictions by distinguishing between different types of frameshift outcomes.

Let's check if these new consequence types are properly implemented throughout the codebase:

Also applies to: 428-428


🏁 Script executed:

#!/bin/bash
# Check if the new consequence types are properly implemented in the codebase

echo "Checking for FrameshiftElongation in the codebase..."
rg "FrameshiftElongation" --type rust

echo "Checking for FrameshiftTruncation in the codebase..."
rg "FrameshiftTruncation" --type rust

echo "Checking for ProteinAlteringVariant in the codebase..."
rg "ProteinAlteringVariant" --type rust

echo "Checking for references to these variants in tests..."
rg "frameshift_elongation|frameshift_truncation|protein_altering_variant" -g "**/*test*"

Length of output: 2180


Verification Complete: New Consequence Types Properly Integrated

The verification confirms that the new consequence types—FrameshiftElongation, FrameshiftTruncation, and ProteinAlteringVariant—are correctly implemented across the codebase:

  • Implementation Check:
    • src/annotate/seqvars/ann.rs and src/annotate/seqvars/csq.rs properly reference the new variants.
    • Test snapshots also include references (e.g., frameshift_truncation) confirming their usage.

These changes successfully enhance the precision of variant classification as outlined in the PR objectives.

src/annotate/seqvars/ann.rs (4)

88-94: New frameshift variants improve classification precision.

The addition of FrameshiftElongation and FrameshiftTruncation variants with clear documentation enhances precision in frameshift variant classification. This directly addresses the PR objective to specify whether a frameshift results in shortening or lengthening of the protein.


142-144: Reinstated protein altering variant classification.

Good job reinstating the ProteinAlteringVariant enum variant with proper documentation. This further supports the PR objective of refining consequence definitions when the variant resolves to a protein-level change.


305-305: Re-enabled protein altering variant impact level.

The inclusion of ProteinAlteringVariant in the Moderate impact level assessment is consistent with variant annotation standards. This reinstated variant properly bridges the gap between high-impact frameshift variants and more localized missense changes.


294-295:

❓ Verification inconclusive

Impact level assignment for new frameshift variants.

The assignment of High impact level to both new frameshift variants is correct. Frameshifts that either elongate or truncate proteins frequently have significant functional consequences, warranting the high impact classification.

Let's check if the tests have been updated to cover these new variants:


🏁 Script executed:

#!/bin/bash
# Check if tests have been updated for the new frameshift variants

echo "Searching for tests involving the new frameshift variants..."
rg -g "**test**" "FrameshiftElongation|FrameshiftTruncation|ProteinAlteringVariant"

echo "Checking csq.rs file for implementation of the new variants..."
rg -g "src/annotate/seqvars/csq.rs" -A 5 -B 5 "frameshift|protein_altering"

Length of output: 7247


Impact Level & Test Coverage for New Frameshift Variants

  • The assignment of High impact to both FrameshiftElongation and FrameshiftTruncation is correct given their potential to drastically alter protein function.
  • The implementation in src/annotate/seqvars/csq.rs shows that the new frameshift logic is in place.
  • However, the search for test references (i.e. explicit mentions of FrameshiftElongation, FrameshiftTruncation, or ProteinAlteringVariant in test files) didn’t yield any results.
    Action: Please verify whether these cases are covered indirectly in existing integration tests or if dedicated tests should be added to ensure robust coverage.
src/annotate/seqvars/csq.rs (7)

9-10: Confirm that new imports are necessary.

These additional imports (AltSeqBuilder, NoRef, and ProteinEdit) expand the functionality for building alt sequences and parsing certain HGVS forms. Ensure that they are all used and don’t introduce unexpected overhead or conflicts.


132-138: Handle 'N' allele carefully and address TODO.

  1. Converting 'N' to the reference is valid under VCF ≥4.1, but consider commenting or tracking how older VCF versions would treat 'N'.
  2. The TODO on line 133 suggests verifying the VCF spec. Implementation or removal of this note would clarify maintenance requirements.

590-596: Double-check parameter order consistency.

The updated call to analyze_protein_variant() includes a &tx_record.tx_ac argument. Verify that the parameters match the updated function signature, especially the conservative boolean’s position, so there's no accidental mix-up.


975-979: 5' UTR intron vs. exon variant logic.

Lines 975–979 differentiate 5' UTR introns from exons. This is a valuable distinction and may reveal edge cases if the boundary between the UTR and coding sequence is incorrectly annotated. Confirm there’s coverage in your tests for intron/exon boundaries near the start codon.


1063-1109: Verify altseq length comparison and missing data scenarios.

  1. The length comparison logic for frameshift truncation vs. elongation is helpful. However, calling alt_data.first().unwrap() in lines 1079–1080 can panic if no alt data is built (edge cases?).
  2. Consider explicitly handling or warning if building altseq yields an empty vector.

1613-1613: Expanded frameshift truncation test coverage.

This new test covers a frameshift truncation scenario. It’s a welcome addition. Ensure that any borderline cases (e.g., same-length frameshifts or partial codon changes) are also tested.


1616-1616: Additional test for stop lost and inframe deletion.

The combination of StopLost + FeatureElongation + ConservativeInframeDeletion is a good example of complex variant outcomes. Keep verifying that multiple concurrent consequences reflect real-world biology.

@tedil tedil enabled auto-merge (squash) February 25, 2025 16:07
@tedil tedil merged commit 73d0fd3 into main Feb 25, 2025
12 checks passed
@tedil tedil deleted the csq-from-prot-seq branch February 25, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant