Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cram stream #31

Merged
merged 42 commits into from
Apr 30, 2024
Merged

Feat/cram stream #31

merged 42 commits into from
Apr 30, 2024

Conversation

AssafSternberg
Copy link
Contributor

@AssafSternberg AssafSternberg commented Mar 28, 2024

Aim: Add stream_cram method to MODO, that allows to stream both local and remote CRAM files.##

PR elements:

1. Refactor of the region parser to allow a wider range of range request formats:

The old region parser did not allow the user to ask for an entire chromosome (e.g., "chr1"), a one side open-ended region (e.g., "chr1:10000" or "chr1:-10500"), or un-mapped reads (e.g., "*"). The refactored parser deals with such requests. Due to the possibility of start and/or end to be None, the output was changed to tuple[chrom: str, start: str, end: str].

2. Refactor slice_cram() to handle the updated output of parse_region():

The old version of slice_cram() worked only with a "reference:start-end" region format, expecting start and end as integers. The New version handles start and end being strings or None.

3. Add slice_remote_cram() to handle remote cram files:

The new function streams or writes to a local file the region of a remote CRAM file. The function uses the helper functions:

  • filename_from_url(): extracts the filename from the given URL.
  • hstget_command(): formats a valid htsget command line request.

4. Add htsget_url function:

Currently not used, but might be useful in the future. Similar to htsget_command(), it returns a valid htsget URL that can be used in command line use (e.g., 'samtools view -H http://localhost/htsget/reads/ex/demo1?referenceName=chr1&start=10000&end=10500&format=CRAM')

5. Add stream_cram method to MODO

The method format is identical for local and remote files. First, it checks that the CRAM file actually exists in the MODO. If it does, based on the existence of s3_endpoint it either calls slice_cram() or slice_remote_cram(). Currently local files are always streamed. For remote files, if an output filename is given, the data is written to a local file. If not, the data is streamed.

… versatile region requests (e.g., ‘chr1’, ‘*’, ‘chr1:100-’, ‘chr1:-1000’, ect.)
…more versatile region requests (e.g., ‘chr1’, ‘*’, ‘chr1:100-’, ‘chr1:-1000’, ect. (2) updated slice_cram function to accept values from refactored parse_region function.)
…mand line request, used for cram slicing and streaming.
…me_from_url function as more than one function uses this functionality.
…ces and streams the CRAM file or writes it to a local file, if given an output filename.
…CRAM files and stream or write them to a local file.
Copy link
Contributor

@almutlue almutlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR. I left some questions and suggestions on mainly minor things, but I have one main point:

… htsget_endpoint not given, it is asumed to have the same domain as s3_endpoint, or None (if no s3_endpoint is given)
…bles. If htsget_endpoint not given, it is asumed to have the same domain as s3_endpoint, or None (if no s3_endpoint is given)"

This reverts commit cc86952.
… htsget_endpoint not given, it is asumed to have the same domain as s3_endpoint, or None (if no s3_endpoint is given)
…nd to take advantage of the added self.s3_endpoint and self.htsget_endpoint to point to the requested CRAM file.
…_name, and to take advantage of the added self.s3_endpoint and self.htsget_endpoint to point to the requested CRAM file."

This reverts commit 2fd9ac4.
…nd to take advantage of the added self.s3_endpoint and self.htsget_endpoint to point to the requested CRAM file.
@AssafSternberg
Copy link
Contributor Author

AssafSternberg commented Apr 24, 2024

Thanks for the great comments and suggestions @almutlue !

Do you @cmdoret have any further comments ... or disagree with any of the above resolutions ... or can I go ahead and merge the branch with main (and if yes, which one should I use 'create a merge commit', 'squash and merge', or 'rebase and merge'?)

@almutlue almutlue self-requested a review April 24, 2024 09:49
fix: change 'cram_name' to 'cram_path' in stream_cram() method variables.

Co-authored-by: almutlue <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we need another example dataset? Can we remove test files before merging?

AssafSternberg and others added 10 commits April 25, 2024 13:56
… end[int, optional]). Leadintg and trailing whitspace is trimmed.
…utput (str, Optional[int], Optional[int]). Add 'reference_filename' argument, for cases where correct reference file location is not given in the CRAM header.
…ion() output (str, Optional[int], Optional[int]).
…m slice is written to a local file. Otherwise, the function returns an iterator.
…uts data stream or iterator, and save_cram(), saves output to local file.
….AlignmentSegment object for both local and remote CRAM files. (1) add helper function to write BytesIO buffer to a temporary pysam.AlignmentFile, yielding alignments from it. (2) updated slice_remote_cram() to uses BytesIO buffer, convert it to an iterator and return it. (3) updated MODO.stream_cram() to supply reference_filename in remote case, save the slice_remote_cram() output to iterator, and return the iterator.
Copy link
Contributor

@almutlue almutlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Assaf.
This looks good to me and works on my side as well.
I needed to specify a local reference file to make it work. It would be nice, if it could use the remote reference as default, but I think that is for a new PR.

@AssafSternberg AssafSternberg merged commit 320d117 into main Apr 30, 2024
2 checks passed
@cmdoret cmdoret deleted the feat/cram_stream branch April 30, 2024 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants