-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve consequence predictions #690
Conversation
WalkthroughThis PR updates several components by revising dependency versions and enhancing variant consequence handling. The Changes
Sequence Diagram(s)sequenceDiagram
participant V as Variant Input
participant CP as ConsequencePredictor
participant PA as ProteinAnalyzer
V->>CP: Request consequence prediction
alt Alternative variant equals "N"
CP->>V: Use reference allele for prediction
else
CP->>PA: Analyze protein variant\n(includes tx_accession)
PA-->>CP: Return detailed consequence (frameshift, missense, etc.)
end
CP->>V: Return final predicted consequence
sequenceDiagram
participant C as Client
participant P as Provider
C->>P: Request sequence for accession (e.g., "NC...")
P->>P: Map accession to chromosome & attempt sequence retrieval
alt Sequence retrieved successfully
P->>C: Return sequence data
else
P->>C: Fallback: retrieve using original accession
alt Sequence found after fallback
P->>C: Return sequence data
else
P->>C: Return error ("No sequence record found")
end
end
Possibly Related PRs
Suggested Reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
… ref still has one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/annotate/seqvars/provider.rs (1)
578-591
: Consider more specific error handling and unwrap safety.
- In line 579, calling
unwrap()
onreference_reader
can cause a panic ifreference_reader
is unexpectedlyNone
. Since you already checkself.reference_available()
before reaching this block, it might be safe in practice. Still, returning a more explicit error or usingexpect("...")
could improve clarity.- Lines 589-591 use
Error::NoSequenceRecord
when decoding UTF-8 fails. A separate error variant likeError::InvalidSequenceEncoding
might help distinguish encoding issues from missing sequence data.src/annotate/seqvars/csq.rs (3)
691-709
: Refine frameshift vs. protein-level outcomes check.This block removes the CDS-level frameshift consequence if a more specific protein-level frameshift or non-empty result is found. While logical, beware of scenarios in which a frameshift might have multiple parallel impacts (e.g., partial in-frame region). Consider tests ensuring no unintended consequence gets suppressed.
944-944
: New 'conservative' parameter.The signature includes a new boolean parameter to distinguish conservative from disruptive changes. Make sure to document its effect in function comments and confirm it’s correctly set by all callers.
991-1030
: Comprehensive handling of in-frame vs. frameshift.These lines thoroughly check reference/alternative lengths modulo 3, properly tagging frameshifts and in-frame edits. Consider test cases for empty strings or extremely large differences to confirm robust boundary handling (e.g., large deletions).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
Cargo.lock
is excluded by!**/*.lock
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193311167-ATGT-T.snap
is excluded by!**/*.snap
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193311170-TGGCG-G.snap
is excluded by!**/*.snap
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_del_opa1_csqs@3-193409913-ATAA-A.snap
is excluded by!**/*.snap
📒 Files selected for processing (7)
Cargo.toml
(1 hunks)openapi.schema.yaml
(3 hunks)src/annotate/seqvars/ann.rs
(4 hunks)src/annotate/seqvars/csq.rs
(15 hunks)src/annotate/seqvars/provider.rs
(1 hunks)src/db/check/mod.rs
(1 hunks)tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Schema
- GitHub Check: Testing
- GitHub Check: build-and-push-image
🔇 Additional comments (15)
Cargo.toml (1)
48-48
:✅ Verification successful
Version update looks good.
The update of the
hgvs
dependency from0.18.0
to0.18.1
is appropriate. This minor version bump likely includes bug fixes related to variant consequence predictions.Let's verify the changelog or release notes for this version to confirm the changes:
🏁 Script executed:
#!/bin/bash # Check the latest release notes for hgvs crate echo "Checking for release information for hgvs 0.18.1..." curl -s "https://crates.io/api/v1/crates/hgvs/0.18.1" | jq '.version.crate_size, .version.created_at, .version.updated_at' # Check git repository for release notes echo "Checking git repository for release notes..." curl -s "https://api.github.com/search/repositories?q=hgvs+language:rust" | jq '.items[0].html_url'Length of output: 544
Update Verified: Dependency hgvs version update confirmed
- Confirmed that
Cargo.toml
now specifieshgvs = "0.18.1"
(line 48).- The output from crates.io shows the release was created on 2025-02-25, and the GitHub repository (https://github.com/varfish-org/hgvs-rs) confirms the updated release information.
- The version bump is appropriate, likely incorporating the expected bug fixes for variant consequence predictions.
src/db/check/mod.rs (1)
157-158
:✅ Verification successful
Make location fields optional to handle missing data.
Good change to make both
location
andlocation_sortable
optional, allowing the application to handle cases where HGNC entries don't have location information. This enhances robustness when dealing with incomplete data sources.Let's verify if there are actual HGNC entries that might have missing location information:
🏁 Script executed:
#!/bin/bash # Search for any possible instances where location information might be missing echo "Searching for handling of missing location information in the codebase..." rg "location: None" -A 2 -B 2 rg "location_sortable: None" -A 2 -B 2 echo "Checking how location fields are accessed in the code..." rg "\.location" --type rust rg "\.location_sortable" --type rustLength of output: 1696
Verified: Optional Location Fields Handling Confirmed
The changes in
src/db/check/mod.rs
(lines 157–158) correctly update bothlocation
andlocation_sortable
to be optional. Our investigation shows that similar handling of missing location data is already present elsewhere (for example, insrc/db/create/mod.rs
wheremap_location
is set toNone
), which confirms that some HGNC entries indeed lack this information. Although the usage ofrecord.location
insrc/verify/seqvars.rs
still performs operations like splitting the string, the overall design now explicitly acknowledges that location data may be missing. Please ensure any direct operations on these fields properly account for a potentialNone
value elsewhere in the codebase.openapi.schema.yaml (2)
11-11
: Version bump aligned with new features.Version updated from
0.30.1
to0.31.0
to reflect the addition of new consequence types. This follows semantic versioning principles correctly, using a minor version increment for backward-compatible feature additions.
415-416
:✅ Verification successful
New consequence types added to the schema.
The addition of
frameshift_elongation
,frameshift_truncation
, andprotein_altering_variant
to the Consequence schema improves the precision of variant classification. These new terms directly support the PR objectives of enhancing consequence predictions by distinguishing between different types of frameshift outcomes.Let's check if these new consequence types are properly implemented throughout the codebase:
Also applies to: 428-428
🏁 Script executed:
#!/bin/bash # Check if the new consequence types are properly implemented in the codebase echo "Checking for FrameshiftElongation in the codebase..." rg "FrameshiftElongation" --type rust echo "Checking for FrameshiftTruncation in the codebase..." rg "FrameshiftTruncation" --type rust echo "Checking for ProteinAlteringVariant in the codebase..." rg "ProteinAlteringVariant" --type rust echo "Checking for references to these variants in tests..." rg "frameshift_elongation|frameshift_truncation|protein_altering_variant" -g "**/*test*"Length of output: 2180
Verification Complete: New Consequence Types Properly Integrated
The verification confirms that the new consequence types—
FrameshiftElongation
,FrameshiftTruncation
, andProteinAlteringVariant
—are correctly implemented across the codebase:
- Implementation Check:
src/annotate/seqvars/ann.rs
andsrc/annotate/seqvars/csq.rs
properly reference the new variants.- Test snapshots also include references (e.g.,
frameshift_truncation
) confirming their usage.These changes successfully enhance the precision of variant classification as outlined in the PR objectives.
src/annotate/seqvars/ann.rs (4)
88-94
: New frameshift variants improve classification precision.The addition of
FrameshiftElongation
andFrameshiftTruncation
variants with clear documentation enhances precision in frameshift variant classification. This directly addresses the PR objective to specify whether a frameshift results in shortening or lengthening of the protein.
142-144
: Reinstated protein altering variant classification.Good job reinstating the
ProteinAlteringVariant
enum variant with proper documentation. This further supports the PR objective of refining consequence definitions when the variant resolves to a protein-level change.
305-305
: Re-enabled protein altering variant impact level.The inclusion of
ProteinAlteringVariant
in theModerate
impact level assessment is consistent with variant annotation standards. This reinstated variant properly bridges the gap between high-impact frameshift variants and more localized missense changes.
294-295
:❓ Verification inconclusive
Impact level assignment for new frameshift variants.
The assignment of
High
impact level to both new frameshift variants is correct. Frameshifts that either elongate or truncate proteins frequently have significant functional consequences, warranting the high impact classification.Let's check if the tests have been updated to cover these new variants:
🏁 Script executed:
#!/bin/bash # Check if tests have been updated for the new frameshift variants echo "Searching for tests involving the new frameshift variants..." rg -g "**test**" "FrameshiftElongation|FrameshiftTruncation|ProteinAlteringVariant" echo "Checking csq.rs file for implementation of the new variants..." rg -g "src/annotate/seqvars/csq.rs" -A 5 -B 5 "frameshift|protein_altering"Length of output: 7247
Impact Level & Test Coverage for New Frameshift Variants
- The assignment of
High
impact to bothFrameshiftElongation
andFrameshiftTruncation
is correct given their potential to drastically alter protein function.- The implementation in
src/annotate/seqvars/csq.rs
shows that the new frameshift logic is in place.- However, the search for test references (i.e. explicit mentions of
FrameshiftElongation
,FrameshiftTruncation
, orProteinAlteringVariant
in test files) didn’t yield any results.
→ Action: Please verify whether these cases are covered indirectly in existing integration tests or if dedicated tests should be added to ensure robust coverage.src/annotate/seqvars/csq.rs (7)
9-10
: Confirm that new imports are necessary.These additional imports (
AltSeqBuilder
,NoRef
, andProteinEdit
) expand the functionality for building alt sequences and parsing certain HGVS forms. Ensure that they are all used and don’t introduce unexpected overhead or conflicts.
132-138
: Handle 'N' allele carefully and address TODO.
- Converting
'N'
to the reference is valid under VCF ≥4.1, but consider commenting or tracking how older VCF versions would treat'N'
.- The TODO on line 133 suggests verifying the VCF spec. Implementation or removal of this note would clarify maintenance requirements.
590-596
: Double-check parameter order consistency.The updated call to
analyze_protein_variant()
includes a&tx_record.tx_ac
argument. Verify that the parameters match the updated function signature, especially theconservative
boolean’s position, so there's no accidental mix-up.
975-979
: 5' UTR intron vs. exon variant logic.Lines 975–979 differentiate 5' UTR introns from exons. This is a valuable distinction and may reveal edge cases if the boundary between the UTR and coding sequence is incorrectly annotated. Confirm there’s coverage in your tests for intron/exon boundaries near the start codon.
1063-1109
: Verify altseq length comparison and missing data scenarios.
- The length comparison logic for frameshift truncation vs. elongation is helpful. However, calling
alt_data.first().unwrap()
in lines 1079–1080 can panic if no alt data is built (edge cases?).- Consider explicitly handling or warning if building altseq yields an empty vector.
1613-1613
: Expanded frameshift truncation test coverage.This new test covers a frameshift truncation scenario. It’s a welcome addition. Ensure that any borderline cases (e.g., same-length frameshifts or partial codon changes) are also tested.
1616-1616
: Additional test for stop lost and inframe deletion.The combination of
StopLost + FeatureElongation + ConservativeInframeDeletion
is a good example of complex variant outcomes. Keep verifying that multiple concurrent consequences reflect real-world biology.
Changes
frameshift_variant
shortens or lengthens by annotating SO termsframeshift_truncation
andframeshift_elongation
inframe_{insertion,deletion}
in terms of CDS, as suggested in their SO term definition{5,3}_prime_UTR_variant
casesprotein_altering_variant
(unless it resolves to a missense)inframe_deletion
,inframe_insertion
andmissense_variant
(previously would always reportdeletion
)New consequence terms
frameshift_truncation
frameshift_elongation
protein_altering_variant
TODO
Summary by CodeRabbit
Chores
New Features
Bug Fixes