-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/cram stream #31
Feat/cram stream #31
Conversation
… versatile region requests (e.g., ‘chr1’, ‘*’, ‘chr1:100-’, ‘chr1:-1000’, ect.)
…more versatile region requests (e.g., ‘chr1’, ‘*’, ‘chr1:100-’, ‘chr1:-1000’, ect. (2) updated slice_cram function to accept values from refactored parse_region function.)
…on after refactoring.
…mand line request, used for cram slicing and streaming.
…et_command function.
…t_command function.
…me_from_url function as more than one function uses this functionality.
…ces and streams the CRAM file or writes it to a local file, if given an output filename.
…m_cram will be the class method name.
…CRAM files and stream or write them to a local file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR. I left some questions and suggestions on mainly minor things, but I have one main point:
- using an
os.system
call to run htseq makes the code more complicated and inconsistent. I think we should use python instead, as this is what the htseq cli calls anyways. We could either use the htseq-api withoutput= sys.stdout.buffer
(as they do in https://github.com/ga4gh/htsget/blob/ac3ba43fd017fc1db173d323a5de2987a354fdb6/htsget/cli.py#L72) or simply write our own requests? I did not manage to fully unravel their code, but it is rather short and looks like little extra to the request call. The code looks slightly smelly to me (not addressed todos from >2 years, hard coded values, https://github.com/ga4gh/htsget/blob/master/htsget/io.py). So maybe it is worth to implement by ourselves?
… htsget_endpoint not given, it is asumed to have the same domain as s3_endpoint, or None (if no s3_endpoint is given)
…bles. If htsget_endpoint not given, it is asumed to have the same domain as s3_endpoint, or None (if no s3_endpoint is given)" This reverts commit cc86952.
… htsget_endpoint not given, it is asumed to have the same domain as s3_endpoint, or None (if no s3_endpoint is given)
…nd to take advantage of the added self.s3_endpoint and self.htsget_endpoint to point to the requested CRAM file.
…_name, and to take advantage of the added self.s3_endpoint and self.htsget_endpoint to point to the requested CRAM file." This reverts commit 2fd9ac4.
…nd to take advantage of the added self.s3_endpoint and self.htsget_endpoint to point to the requested CRAM file.
…s they are no longer used.
Thanks for the great comments and suggestions @almutlue ! Do you @cmdoret have any further comments ... or disagree with any of the above resolutions ... or can I go ahead and merge the branch with main (and if yes, which one should I use 'create a merge commit', 'squash and merge', or 'rebase and merge'?) |
fix: change 'cram_name' to 'cram_path' in stream_cram() method variables. Co-authored-by: almutlue <[email protected]>
Co-authored-by: almutlue <[email protected]>
Co-authored-by: almutlue <[email protected]>
data/modo1/data.zarr/.zgroup
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we need another example dataset? Can we remove test files before merging?
… end[int, optional]). Leadintg and trailing whitspace is trimmed.
…o feat/cram_stream
…utput (str, Optional[int], Optional[int]). Add 'reference_filename' argument, for cases where correct reference file location is not given in the CRAM header.
…ion() output (str, Optional[int], Optional[int]).
…m slice is written to a local file. Otherwise, the function returns an iterator.
…uts data stream or iterator, and save_cram(), saves output to local file.
….AlignmentSegment object for both local and remote CRAM files. (1) add helper function to write BytesIO buffer to a temporary pysam.AlignmentFile, yielding alignments from it. (2) updated slice_remote_cram() to uses BytesIO buffer, convert it to an iterator and return it. (3) updated MODO.stream_cram() to supply reference_filename in remote case, save the slice_remote_cram() output to iterator, and return the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Assaf.
This looks good to me and works on my side as well.
I needed to specify a local reference file to make it work. It would be nice, if it could use the remote reference as default, but I think that is for a new PR.
Aim: Add stream_cram method to MODO, that allows to stream both local and remote CRAM files.##
PR elements:
1. Refactor of the region parser to allow a wider range of range request formats:
The old region parser did not allow the user to ask for an entire chromosome (e.g., "chr1"), a one side open-ended region (e.g., "chr1:10000" or "chr1:-10500"), or un-mapped reads (e.g., "*"). The refactored parser deals with such requests. Due to the possibility of start and/or end to be None, the output was changed to tuple[chrom: str, start: str, end: str].
2. Refactor slice_cram() to handle the updated output of parse_region():
The old version of slice_cram() worked only with a "reference:start-end" region format, expecting start and end as integers. The New version handles start and end being strings or None.
3. Add slice_remote_cram() to handle remote cram files:
The new function streams or writes to a local file the region of a remote CRAM file. The function uses the helper functions:
4. Add htsget_url function:
Currently not used, but might be useful in the future. Similar to htsget_command(), it returns a valid htsget URL that can be used in command line use (e.g., 'samtools view -H http://localhost/htsget/reads/ex/demo1?referenceName=chr1&start=10000&end=10500&format=CRAM')
5. Add stream_cram method to MODO
The method format is identical for local and remote files. First, it checks that the CRAM file actually exists in the MODO. If it does, based on the existence of s3_endpoint it either calls slice_cram() or slice_remote_cram(). Currently local files are always streamed. For remote files, if an output filename is given, the data is written to a local file. If not, the data is streamed.