-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow specifying reference FASTA #684
Conversation
Warning Rate limit exceeded@tedil has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 5 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request updates dependencies and integrates new features to support reference genome handling in the annotation process. It modifies dependency versions in the package manifest, refactors mapping logic in consequence predictors, adds support for providing a reference genome (with an option for in-memory loading), and updates command-line argument parsing. A new module for FASTA reference reading has also been introduced, affecting various modules in the annotator and provider. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Command Line
participant Server as Server Runner
participant Annotator as ConsequenceAnnotator
participant Provider as MehariProvider
participant RefReader as ReferenceReader
CLI->>Server: Pass arguments (reference, in_memory_reference)
Server->>Annotator: setup_seqvars_annotator(reference, in_memory_reference)
Annotator->>Provider: Initialize provider(new reference parameters)
Provider->>RefReader: Initialize ReferenceReader (if reference provided)
RefReader-->>Provider: Return reference access handle
Provider-->>Annotator: Return configured provider
Annotator-->>Server: Annotator ready
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…ered random access
…embly can be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (15)
src/annotate/seqvars/reference.rs (3)
1-15
: Consider avoiding.expect()
for production-grade code.While using
.expect("Failed to...")
is acceptable for rapid prototyping, returning an error instead of panicking is often preferable for robust libraries and applications. This approach will allow upstream error handling and avoid abrupt terminations on recoverable failures.- let reference_reader = bio::io::fasta::Reader::from_file(&reference_path) - .expect("Failed to create FASTA reader"); + let reference_reader = bio::io::fasta::Reader::from_file(&reference_path) + .map_err(|e| anyhow!("Failed to create FASTA reader: {}", e))?;
17-36
: Use a more descriptive error on read failures.Currently, the code uses
.expect("Failed to read FASTA record")
, which can cause a panic. A better practice is to collect the error and return via?
, preserving the original cause and improving debuggability.- let record = r.expect("Failed to read FASTA record"); + let record = r.map_err(|e| anyhow!("Error reading FASTA record: {}", e))?;
58-96
: Return errors instead of panics for file operations.In
from_path()
, certain checks rely onexpect
. Swapping those out for graceful error handling will make the code more resilient and user-friendly. This ensures an I/O error does not crash the entire application.src/annotate/seqvars/provider.rs (7)
27-30
: Consolidate import statements for clarity.These imports appear in consecutive lines. Consider grouping them into a single block or reusing existing blocks if your style guidelines allow it, to keep the module more organized.
183-192
: Document the new fields.
reference_reader
,chrom_to_acc
, andacc_to_chrom
are added. Add doc comments explaining their intended usage (on-demand reference retrieval, chromosome-accession mapping) for future maintainers.
305-314
: Watch out for large memory usage.Loading large references in memory (line 311) can lead to high memory usage. This might be addressed in future tasks (like on-demand reading). For now, consider adding a warning in the docs or logs when using
InMemoryFastaAccess
.
334-449
: Transcript picking logic.The picked-genes approach is thorough. However, keep an eye on performance if the gene list grows very large. A small micro-optimization might include short-circuiting once you find a suitable pick in
TranscriptPickMode::First
.
525-529
: Consider caching the built chromosome-to-acc map.If
build_chrom_to_acc()
is repeatedly called, it might be beneficial to cache the result. If it’s only called once, the current approach is fine.
543-545
: Publicly exposing_get_assembly_map
might improve composability.Having a single place where assembly data is retrieved fosters consistency. If there's potential reuse in other modules, re-export or rename
_get_assembly_map
to reflect a more general usage.
887-894
: Assembly map building.This function is essential for capturing sequence name mappings. The usage of
IndexMap
for preserving insertion order is a nice touch if order matters, otherwise a regularHashMap
may suffice.src/server/run/mod.rs (1)
119-126
: Optional reference and in-memory flag.These added CLI options look good.
#[arg(long, requires = "reference")]
ensures correctness ifin_memory_reference
is set. Just make sure help messages highlight the memory impact of loading a large reference.src/verify/seqvars.rs (3)
36-38
: Improve field documentation for memory implications.The
in_memory_reference
field should include documentation about memory requirements, especially since loading a human reference genome requires approximately 3GB of memory./// Read the reference genome into memory. +/// Note: Loading a human reference genome requires approximately 3GB of memory. #[arg(long, requires = "reference")] pub in_memory_reference: bool,
149-150
: Remove commented out code.This commented out code is no longer needed and should be removed to maintain code cleanliness.
- // let reference = noodles::fasta::io::indexed_reader::Builder::default() - // .build_from_path(&args.path_reference_fasta)?;
134-137
: Enhance error message for FASTA file opening.The error message could be more specific about potential issues when opening the reference FASTA file.
tracing::info!( - "Opening reference FASTA file: {}", + "Opening reference FASTA file (requires FAI index): {}", &args.path_reference_fasta );src/annotate/seqvars/csq.rs (1)
1251-1255
: Add test cases for reference FASTA handling.Consider adding test cases that verify the behavior with and without a reference FASTA file.
Add test cases like:
#[test] fn test_with_reference_fasta() { // Test prediction with reference FASTA } #[test] fn test_without_reference_fasta() { // Test prediction without reference FASTA }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.toml
(2 hunks)src/annotate/seqvars/csq.rs
(10 hunks)src/annotate/seqvars/mod.rs
(15 hunks)src/annotate/seqvars/provider.rs
(10 hunks)src/annotate/seqvars/reference.rs
(1 hunks)src/annotate/strucvars/csq.rs
(1 hunks)src/server/run/mod.rs
(3 hunks)src/verify/seqvars.rs
(4 hunks)
🔇 Additional comments (19)
src/annotate/seqvars/provider.rs (10)
4-6
: Imports look consistent.
No issues found. The addition of reference-related modules seems correct.
199-202
: Enum usage is appropriate.
This approach of wrapping each struct variant is a clean way to unify multiple implementations under one trait object.
204-216
: Consistent trait forwarding is good.Allowing
ReferenceReaderImpl
to proxy calls to eitherInMemoryFastaAccess
orUnbufferedIndexedFastaAccess
is a tidy design. No immediate issues.
240-245
: Support for optional reference is handled well.The new parameters for the
Provider::new()
method gracefully handle both in-memory and unbuffered reference usage. Nicely done.
287-287
: Ensure all impacted code paths honor this new picking logic.Whenever
picked_gene_to_tx_map
is used, confirm that you do not inadvertently override or skip references if new transcript picks are required. This new logic looks fine but double-check usage in downstream code.
316-318
: Chromosome-accession mapping is a great addition.This mapping will help unify the naming conventions. No direct issues found.
521-523
: Getter is straightforward.
No concerns regardingreference_available()
, as it cleanly checks presence ofreference_reader
.
572-587
: Reference-based retrieval logic.The fallback to
reference_reader
is sound for “NC|NT|NW” accessions. Verify that partial range requests for these reference sequences are validated. A quick boundary check prior to line 583 would avoid empty or out-of-bounds slices.
588-611
: Transcript-based fallback logic is clear.Using transcript DB if the reference is not available is sensible. Just ensure the fallback performs well for large queries, especially in server scenarios.
[approve]
896-908
: chrom → acc mapping extension.Appending both “chrX” and “X” keys fosters robust queries. Potentially watch out for collisions if non-standard contigs are present (e.g., “chrUn”). Using a fallback for unknown contigs might help.
src/server/run/mod.rs (1)
17-17
: ImportPathBuf
is consistent with usage.
No issues identified.src/annotate/strucvars/csq.rs (1)
454-454
: LGTM! Improved code maintainability.The change to use
provider.build_chrom_to_acc
encapsulates the chromosome-to-accession mapping logic, making the code more maintainable.src/annotate/seqvars/csq.rs (1)
88-96
: LGTM! Improved mapper configuration based on reference availability.The mapper configuration is now properly set based on reference availability, which is crucial for accurate HGVS nomenclature.
src/annotate/seqvars/mod.rs (4)
62-62
: LGTM!The new
reference
module is appropriately scoped and named.
85-91
: LGTM!The new fields are well-documented and properly constrained. The
requires
attribute ensures correct usage of the in-memory option.
1834-1839
: LGTM!The function signature is updated consistently with the new configuration options.
1842-1850
: LGTM!The provider initialization correctly incorporates the new reference parameters while maintaining existing configuration.
Cargo.toml (2)
81-81
: LGTM!The
memmap2
dependency is appropriate for efficient memory-mapped file access to reference FASTA files.
48-48
:❓ Verification inconclusive
Verify compatibility with HGVS 0.18.0.
The HGVS dependency has been updated to a new minor version, which may include breaking changes.
🌐 Web query:
What are the breaking changes in HGVS 0.18.0 compared to 0.17.5?
💡 Result:
The provided search results do not contain information about an "HGVS 0.18.0" release or breaking changes between versions 0.17.5 and 0.18.0 for any software named "HGVS." This appears to be a potential misunderstanding:
HGVS Nomenclature (Human Genome Variation Society) uses version numbers like 21.x.x ([source][2]), with no mention of a 0.x.x version scheme in any of the provided resources.
The only reference to version 0.18.0 comes from an unrelated project called wlroots ([source][10]), which describes breaking changes for compositors in its backend system.
Before providing further assistance, please verify:
- Whether this refers to HGVS nomenclature guidelines (human genome variant descriptions)
- Or a software/library named "HGVS" with semantic versioning (e.g.,
hgvs
Python package)Let me know how you'd like to proceed!
Citations:
- 1: https://hgvs-nomenclature.org/versions/
- 2: https://hgvs-nomenclature.org/versions/21.0/
- 3: https://hgvs-nomenclature.org/recommendations/general/
- 4: https://hgvs-nomenclature.org/versions/older-versions/
- 5: https://www.ncbi.nlm.nih.gov/clinvar/variation/638385
- 6: https://archive.hgvs-nomenclature.org/versioning/
- 7: https://www.hgvs.org/mutnomen/version-list.html
- 8: https://github.com/biocommons/hgvs/blob/main/CHANGELOG
- 9: https://www.hgvs.org/mutnomen/FAQ.html
- 10: https://gitlab.freedesktop.org/wlroots/wlroots/-/releases/0.18.0
HGVS 0.18.0 Compatibility Verification
The dependency update in Cargo.toml (line 48) changes HGVS from version 0.17.5 to 0.18.0. A web query for documented breaking changes did not return any official notes or changelogs indicating modifications that would impact our usage. However, since the version bump might signal subtle differences, please ensure that:
- All functionality relying on HGVS is thoroughly tested.
- Integration tests are run to confirm that the new version’s behavior matches expectations.
- If any undocumented changes are discovered during testing, consider reviewing the dependency’s repository or contacting its maintainers for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/annotate/seqvars/mod.rs
(18 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Testing
- GitHub Check: Schema
- GitHub Check: build-and-push-image
🔇 Additional comments (3)
src/annotate/seqvars/mod.rs (3)
62-62
: LGTM!The new
reference
module is correctly declared with appropriate visibility and follows Rust naming conventions.
85-91
: LGTM!The new fields are well-designed with:
- Appropriate types for their purposes
- Clear documentation
- Correct dependency enforcement between fields using the
requires
attribute
1834-1839
: LGTM!The function signature changes are consistent and properly propagate the new reference parameters throughout the codebase.
Also applies to: 1996-2002, 2077-2083
To properly shift genomic variants in the 3' direction as enforced by HGVS nomenclature, reference sequences have to be available. This is especially important for variants close to / crossing / shuffled towards intron/exon boundaries (about ~40k in clinvar VCF), as otherwise the hgvs descriptions will be incorrect.
Therefore, users can now optionally specify a genomic reference FASTA file via
--reference
.Right now, we read the reference sequences into memory completely, which requires ~3GB of memory for a standard human reference genome.
TODO:
reading on-demand from FASTA.BGZ + FAI/GZI(postponed)explore caching only 1 contig at a time, which would be fine for sorted VCFsSummary by CodeRabbit
ConsequencePredictor
for better initialization efficiency.