fix: improve consequence predictions #690

tedil · 2025-02-22T09:46:24Z

Changes

be specific whether a frameshift_variant shortens or lengthens by annotating SO terms frameshift_truncation and frameshift_elongation
handle cases where a (cDNA) frameshift effectively resolves to a missense on protein level (this requires re-obtaining the reference sequence and integrating the variant via the AltBuilder, as hgvs-rs does not provide all necessary information for this use-case)
define inframe_{insertion,deletion} in terms of CDS, as suggested in their SO term definition
fix some {5,3}_prime_UTR_variant cases
in the case of a delIns (on protein level):
- always annotate protein_altering_variant (unless it resolves to a missense)
- evaluate length change (between reference and alternative) to differentiate between inframe_deletion, inframe_insertion and missense_variant (previously would always report deletion)

New consequence terms

TODO

update test cases

Summary by CodeRabbit

Chores
- Updated dependencies and API specification to enhance overall stability.
New Features
- Expanded variant consequence reporting with additional annotations for frameshift events and protein-altering changes.
- Enhanced variant analysis with improved normalization and detailed protein evaluations.
Bug Fixes
- Improved sequence data retrieval with clearer, more robust error messaging.
- Refined handling of optional location information for more consistent outputs.

…detail

…hift prediction

coderabbitai · 2025-02-22T09:53:04Z

Walkthrough

This PR updates several components by revising dependency versions and enhancing variant consequence handling. The hgvs dependency is bumped in Cargo.toml, while the OpenAPI schema now supports additional consequence types. Updates in annotation modules introduce new enum variants for frameshift consequences and adjust the impact logic. The variant prediction and protein analysis flow in the code has been refined, and sequence retrieval now benefits from improved error handling. Additionally, the HgncEntry structure now permits missing location information, and test metadata tracked by Git LFS has been updated.

Changes

File(s)	Change Summary
Cargo.toml	Updated dependency `hgvs` from version `0.18.0` to `0.18.1`.
openapi.schema.yaml	Updated version from `0.30.1` to `0.31.0`; added consequence types: `frameshift_elongation`, `frameshift_truncation`, and `protein_altering_variant`.
src/annotate/seqvars/{ann.rs, csq.rs, provider.rs}	In `ann.rs`, new variants `FrameshiftElongation` and `FrameshiftTruncation` were added and `ProteinAlteringVariant` reinstated in the `Consequence` enum. In `csq.rs`, prediction logic was refined (including normalization and protein analysis updates), and in `provider.rs`, error handling in sequence retrieval was enhanced.
src/db/check/mod.rs	Changed `HgncEntry`'s `location` and `location_sortable` fields from `String` to `Option<String>`.
tests/.../vep.disagreement-cases.expected.vcf	Git LFS metadata updated: new oid and increased file size (from 17,967 bytes to 18,657 bytes).

Sequence Diagram(s)

sequenceDiagram
    participant V as Variant Input
    participant CP as ConsequencePredictor
    participant PA as ProteinAnalyzer
    V->>CP: Request consequence prediction
    alt Alternative variant equals "N"
        CP->>V: Use reference allele for prediction
    else 
        CP->>PA: Analyze protein variant\n(includes tx_accession)
        PA-->>CP: Return detailed consequence (frameshift, missense, etc.)
    end
    CP->>V: Return final predicted consequence

sequenceDiagram
    participant C as Client
    participant P as Provider
    C->>P: Request sequence for accession (e.g., "NC...")
    P->>P: Map accession to chromosome & attempt sequence retrieval
    alt Sequence retrieved successfully
        P->>C: Return sequence data
    else
        P->>C: Fallback: retrieve using original accession
        alt Sequence found after fallback
            P->>C: Return sequence data
        else
            P->>C: Return error ("No sequence record found")
        end
    end

Possibly Related PRs

feat: Add rare amino acid SO term #658: Updates the Consequence enum by adding new variants, aligning with changes made in this PR.
fix: inframe vs missense #659: Revises the analyze_protein_variant method to enhance protein consequence analysis, similar to the improvements introduced here.

Suggested Reviewers

holtgrewe

Poem

I’m a rabbit in the code-field, hopping with delight,
New frameshifts and variants shining bright.
Dependencies updated, logic refined with care,
Each commit a sweet carrot, fresh and rare.
Through twists of code I joyfully roam,
Celebrating these changes, right at home.
🥕🐇 Happy coding under moonlit night!

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

… ref still has one

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/annotate/seqvars/provider.rs (1)

578-591: Consider more specific error handling and unwrap safety.

In line 579, calling unwrap() on reference_reader can cause a panic if reference_reader is unexpectedly None. Since you already check self.reference_available() before reaching this block, it might be safe in practice. Still, returning a more explicit error or using expect("...") could improve clarity.

Lines 589-591 use Error::NoSequenceRecord when decoding UTF-8 fails. A separate error variant like Error::InvalidSequenceEncoding might help distinguish encoding issues from missing sequence data.

src/annotate/seqvars/csq.rs (3)

691-709: Refine frameshift vs. protein-level outcomes check.

This block removes the CDS-level frameshift consequence if a more specific protein-level frameshift or non-empty result is found. While logical, beware of scenarios in which a frameshift might have multiple parallel impacts (e.g., partial in-frame region). Consider tests ensuring no unintended consequence gets suppressed.

944-944: New 'conservative' parameter.

The signature includes a new boolean parameter to distinguish conservative from disruptive changes. Make sure to document its effect in function comments and confirm it’s correctly set by all callers.

991-1030: Comprehensive handling of in-frame vs. frameshift.

These lines thoroughly check reference/alternative lengths modulo 3, properly tagging frameshifts and in-frame edits. Consider test cases for empty strings or extremely large differences to confirm robust boundary handling (e.g., large deletions).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea2f40e and f51ecc0.

⛔ Files ignored due to path filters (4)

Cargo.lock is excluded by !**/*.lock
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193311167-ATGT-T.snap is excluded by !**/*.snap
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193311170-TGGCG-G.snap is excluded by !**/*.snap
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193409913-ATAA-A.snap is excluded by !**/*.snap

📒 Files selected for processing (7)

Cargo.toml (1 hunks)
openapi.schema.yaml (3 hunks)
src/annotate/seqvars/ann.rs (4 hunks)
src/annotate/seqvars/csq.rs (15 hunks)
src/annotate/seqvars/provider.rs (1 hunks)
src/db/check/mod.rs (1 hunks)
tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: Schema
GitHub Check: Testing
GitHub Check: build-and-push-image

🔇 Additional comments (15)

Cargo.toml (1)

48-48:
✅ Verification successful

Version update looks good.

The update of the hgvs dependency from 0.18.0 to 0.18.1 is appropriate. This minor version bump likely includes bug fixes related to variant consequence predictions.

Let's verify the changelog or release notes for this version to confirm the changes:

🏁 Script executed:
#!/bin/bash
# Check the latest release notes for hgvs crate

echo "Checking for release information for hgvs 0.18.1..."
curl -s "https://crates.io/api/v1/crates/hgvs/0.18.1" | jq '.version.crate_size, .version.created_at, .version.updated_at'

# Check git repository for release notes
echo "Checking git repository for release notes..."
curl -s "https://api.github.com/search/repositories?q=hgvs+language:rust" | jq '.items[0].html_url'
Length of output: 544
Update Verified: Dependency hgvs version update confirmed

Confirmed that Cargo.toml now specifies hgvs = "0.18.1" (line 48).

The output from crates.io shows the release was created on 2025-02-25, and the GitHub repository (https://github.com/varfish-org/hgvs-rs) confirms the updated release information.

The version bump is appropriate, likely incorporating the expected bug fixes for variant consequence predictions.
src/db/check/mod.rs (1)

157-158:
✅ Verification successful

Make location fields optional to handle missing data.

Good change to make both location and location_sortable optional, allowing the application to handle cases where HGNC entries don't have location information. This enhances robustness when dealing with incomplete data sources.

Let's verify if there are actual HGNC entries that might have missing location information:

🏁 Script executed:
#!/bin/bash
# Search for any possible instances where location information might be missing

echo "Searching for handling of missing location information in the codebase..."
rg "location: None" -A 2 -B 2
rg "location_sortable: None" -A 2 -B 2

echo "Checking how location fields are accessed in the code..."
rg "\.location" --type rust
rg "\.location_sortable" --type rust
Length of output: 1696
Verified: Optional Location Fields Handling Confirmed

The changes in src/db/check/mod.rs (lines 157–158) correctly update both location and location_sortable to be optional. Our investigation shows that similar handling of missing location data is already present elsewhere (for example, in src/db/create/mod.rs where map_location is set to None), which confirms that some HGNC entries indeed lack this information. Although the usage of record.location in src/verify/seqvars.rs still performs operations like splitting the string, the overall design now explicitly acknowledges that location data may be missing. Please ensure any direct operations on these fields properly account for a potential None value elsewhere in the codebase.
openapi.schema.yaml (2)

11-11: Version bump aligned with new features.

Version updated from 0.30.1 to 0.31.0 to reflect the addition of new consequence types. This follows semantic versioning principles correctly, using a minor version increment for backward-compatible feature additions.

415-416:
✅ Verification successful

New consequence types added to the schema.

The addition of frameshift_elongation, frameshift_truncation, and protein_altering_variant to the Consequence schema improves the precision of variant classification. These new terms directly support the PR objectives of enhancing consequence predictions by distinguishing between different types of frameshift outcomes.

Let's check if these new consequence types are properly implemented throughout the codebase:

Also applies to: 428-428

🏁 Script executed:
#!/bin/bash
# Check if the new consequence types are properly implemented in the codebase

echo "Checking for FrameshiftElongation in the codebase..."
rg "FrameshiftElongation" --type rust

echo "Checking for FrameshiftTruncation in the codebase..."
rg "FrameshiftTruncation" --type rust

echo "Checking for ProteinAlteringVariant in the codebase..."
rg "ProteinAlteringVariant" --type rust

echo "Checking for references to these variants in tests..."
rg "frameshift_elongation|frameshift_truncation|protein_altering_variant" -g "**/*test*"
Length of output: 2180
Verification Complete: New Consequence Types Properly Integrated

The verification confirms that the new consequence types—FrameshiftElongation, FrameshiftTruncation, and ProteinAlteringVariant—are correctly implemented across the codebase:

Implementation Check:

src/annotate/seqvars/ann.rs and src/annotate/seqvars/csq.rs properly reference the new variants.

Test snapshots also include references (e.g., frameshift_truncation) confirming their usage.

These changes successfully enhance the precision of variant classification as outlined in the PR objectives.
src/annotate/seqvars/ann.rs (4)

88-94: New frameshift variants improve classification precision.

The addition of FrameshiftElongation and FrameshiftTruncation variants with clear documentation enhances precision in frameshift variant classification. This directly addresses the PR objective to specify whether a frameshift results in shortening or lengthening of the protein.

142-144: Reinstated protein altering variant classification.

Good job reinstating the ProteinAlteringVariant enum variant with proper documentation. This further supports the PR objective of refining consequence definitions when the variant resolves to a protein-level change.

305-305: Re-enabled protein altering variant impact level.

The inclusion of ProteinAlteringVariant in the Moderate impact level assessment is consistent with variant annotation standards. This reinstated variant properly bridges the gap between high-impact frameshift variants and more localized missense changes.

294-295:
❓ Verification inconclusive

Impact level assignment for new frameshift variants.

The assignment of High impact level to both new frameshift variants is correct. Frameshifts that either elongate or truncate proteins frequently have significant functional consequences, warranting the high impact classification.

Let's check if the tests have been updated to cover these new variants:

🏁 Script executed:
#!/bin/bash
# Check if tests have been updated for the new frameshift variants

echo "Searching for tests involving the new frameshift variants..."
rg -g "**test**" "FrameshiftElongation|FrameshiftTruncation|ProteinAlteringVariant"

echo "Checking csq.rs file for implementation of the new variants..."
rg -g "src/annotate/seqvars/csq.rs" -A 5 -B 5 "frameshift|protein_altering"
Length of output: 7247
Impact Level & Test Coverage for New Frameshift Variants

The assignment of High impact to both FrameshiftElongation and FrameshiftTruncation is correct given their potential to drastically alter protein function.

The implementation in src/annotate/seqvars/csq.rs shows that the new frameshift logic is in place.

However, the search for test references (i.e. explicit mentions of FrameshiftElongation, FrameshiftTruncation, or ProteinAlteringVariant in test files) didn’t yield any results.
→ Action: Please verify whether these cases are covered indirectly in existing integration tests or if dedicated tests should be added to ensure robust coverage.
src/annotate/seqvars/csq.rs (7)

9-10: Confirm that new imports are necessary.

These additional imports (AltSeqBuilder, NoRef, and ProteinEdit) expand the functionality for building alt sequences and parsing certain HGVS forms. Ensure that they are all used and don’t introduce unexpected overhead or conflicts.

132-138: Handle 'N' allele carefully and address TODO.

Converting 'N' to the reference is valid under VCF ≥4.1, but consider commenting or tracking how older VCF versions would treat 'N'.

The TODO on line 133 suggests verifying the VCF spec. Implementation or removal of this note would clarify maintenance requirements.

590-596: Double-check parameter order consistency.

The updated call to analyze_protein_variant() includes a &tx_record.tx_ac argument. Verify that the parameters match the updated function signature, especially the conservative boolean’s position, so there's no accidental mix-up.

975-979: 5' UTR intron vs. exon variant logic.

Lines 975–979 differentiate 5' UTR introns from exons. This is a valuable distinction and may reveal edge cases if the boundary between the UTR and coding sequence is incorrectly annotated. Confirm there’s coverage in your tests for intron/exon boundaries near the start codon.

1063-1109: Verify altseq length comparison and missing data scenarios.

The length comparison logic for frameshift truncation vs. elongation is helpful. However, calling alt_data.first().unwrap() in lines 1079–1080 can panic if no alt data is built (edge cases?).

Consider explicitly handling or warning if building altseq yields an empty vector.

1613-1613: Expanded frameshift truncation test coverage.

This new test covers a frameshift truncation scenario. It’s a welcome addition. Ensure that any borderline cases (e.g., same-length frameshifts or partial codon changes) are also tested.

1616-1616: Additional test for stop lost and inframe deletion.

The combination of StopLost + FeatureElongation + ConservativeInframeDeletion is a good example of complex variant outcomes. Keep verifying that multiple concurrent consequences reflect real-world biology.

tedil added 8 commits February 17, 2025 17:59

in the case of ProteinEdit::Fs, fetch ref+alt seq to compare in more …

df65314

…detail

include protein altering variant in delins cases

d8c3b9a

remove unused import

5893b6f

swap order of expected vs actual for readability

53433d3

wip: add special case fixes for CDS frameshift but not protein frames…

5ea7d28

…hift prediction

fix utr consequence predictions

fd9126d

define frameshift and inframe insertion/deletion in terms of CDS

3eb9aed

only report inframe ins/del when exonic

c31791b

tedil self-assigned this Feb 22, 2025

tedil added 2 commits February 22, 2025 10:47

fmt

368b6a2

update openapi schema

6ad558a

tedil added 14 commits February 22, 2025 10:57

temporarily point to fix-ter-subst branch of hgvs-rs

6ac6d90

newer hgnc_complete_set json entries do not always have a location

6c4842b

fix: according to vcf spec: ALT=N → REF=ALT

df69b91

fix: try both chrom and acc for reference readers. slightly ugly.

bda664e

bump to hgvs-rs 0.18.1

d0abf8e

annotate stoplost for p delins if alternative doesn't have a stop but…

9270d74

… ref still has one

fmt

dcca9a7

update Cargo.lock

32c6816

include conservative inframe deletion csq in annotate_del_opa1_csqs test

2e303b8

update insta snapshots

716c52f

include frameshift truncation in annotate_del_opa1_csqs

6105598

update vep disagreement cases expected csqs

d221314

update snapshots

e28aee0

refactor testing code, skip one specific var

f51ecc0

tedil marked this pull request as ready for review February 25, 2025 15:58

coderabbitai bot reviewed Feb 25, 2025

View reviewed changes

tedil enabled auto-merge (squash) February 25, 2025 16:07

tedil merged commit 73d0fd3 into main Feb 25, 2025
12 checks passed

tedil deleted the csq-from-prot-seq branch February 25, 2025 16:22

varfish-bot mentioned this pull request Feb 17, 2025

chore(main): release 0.32.0 #685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve consequence predictions #690

fix: improve consequence predictions #690

tedil commented Feb 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 22, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

fix: improve consequence predictions #690

fix: improve consequence predictions #690

Conversation

tedil commented Feb 22, 2025 • edited by coderabbitai bot Loading

Changes

New consequence terms

TODO

Summary by CodeRabbit

coderabbitai bot commented Feb 22, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly Related PRs

Suggested Reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

tedil commented Feb 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)