-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
address memory issue with large GISAID files #94
Conversation
Wow, even if you call |
PS - this plugin will need a dev-bump in order to get CI passing. |
Yep, just confirmed again locally. This is one of the four test failures that I get in that case (all result in the same ValueError):
|
I think that makes sense, as This is all pretty confusing, but I like the temp-file approach. The extra disk-IO is a bummer, but it beats the memory requirements for the original. The only other alternative would be to create the skbio.DNA objects in the generator (by re-implementing |
Oh nice, yeah I didn't realize that was consuming a generator - makes sense! |
Another approach would be to return the |
So this might be a silly question, and I'm not super familiar with this part of the code in this project, but why not just refactor diff --git a/genome_sampler/plugin_setup.py b/genome_sampler/plugin_setup.py
index b7f4ec4..9ecb9fe 100644
--- a/genome_sampler/plugin_setup.py
+++ b/genome_sampler/plugin_setup.py
@@ -112,47 +112,49 @@ def _3(fmt: IDSelectionDirFmt) -> IDSelection:
def _read_gisaid_dna_fasta(path):
- def _cleanup_gen():
- with open(path) as input_f:
- lines = None
- for line in input_f:
- if line.startswith('>'):
- if lines is not None:
- yield from lines
- lines = [line]
- elif lines is not None:
- # Due to a bug in skbio 0.5.5, the lowercase option can't
- # be used with skbio.io.read for reading DNA sequences.
- # Convert sequences to uppercase here.
- line = line.upper()
- # Spaces and gap characters can appear in unaligned GISAID
- # sequence records, so we strip those. U characters are
- # additionally replaced with T characters.
- line = line.replace('-', '')
- line = line.replace('.', '')
- line = line.replace(' ', '')
- line = line.replace('U', 'T')
-
- observed_chars = set(line.strip())
- disallowed_chars = observed_chars - skbio.DNA.alphabet
- if disallowed_chars:
- # Spaces and gap characters can appear in unaligned GISAID
- # sequence records, so we strip those. U characters are
- # additionally replaced with T characters.
- line = line.replace('-', '')
- line = line.replace('.', '')
- line = line.replace(' ', '')
- line = line.replace('U', 'T')
-
- observed_chars = set(line.strip())
- disallowed_chars = observed_chars - skbio.DNA.alphabet
- if disallowed_chars:
- print('Note: Non-IUPAC DNA characters (%s) in '
- 'sequence record %s. This record will be '
- 'excluded from the output.' %
- (' '.join(disallowed_chars),
- lines[0][1:].split()[0]),
- file=sys.stderr)
- lines = None
- else:
- lines.append(line)
+ with open(path) as input_f, \
+ with tempfile.TemporaryFile(mode='w+') as output_f:
+ lines = None
+ for line in input_f:
+ if line.startswith('>'):
+ if lines is not None:
+ for l in lines:
+ output_f.write(l)
+ lines = [line]
+ elif lines is not None:
+ # Due to a bug in skbio 0.5.5, the lowercase option can't
+ # be used with skbio.io.read for reading DNA sequences.
+ # Convert sequences to uppercase here.
+ line = line.upper()
+ # Spaces and gap characters can appear in unaligned GISAID
+ # sequence records, so we strip those. U characters are
+ # additionally replaced with T characters.
+ line = line.replace('-', '')
+ line = line.replace('.', '')
+ line = line.replace(' ', '')
+ line = line.replace('U', 'T')
+
+ observed_chars = set(line.strip())
+ disallowed_chars = observed_chars - skbio.DNA.alphabet
+ if disallowed_chars:
+ print('Note: Non-IUPAC DNA characters (%s) in '
+ 'sequence record %s. This record will be '
+ 'excluded from the output.' %
+ (' '.join(disallowed_chars),
+ lines[0][1:].split()[0]),
+ file=sys.stderr)
+ lines = None
else:
- continue
+ lines.append(line)
+ else:
+ continue
- if lines is not None:
- yield from lines
+ if lines is not None:
+ for l in lines:
+ output_f.write(l)
- result = skbio.io.read(_cleanup_gen(), verify=False,
- format='fasta', constructor=skbio.DNA)
+ result = skbio.io.read(output_f, verify=False,
+ format='fasta', constructor=skbio.DNA)
return result The patch above is untested, but basically just gets rid of the closure and writes each "cleaned up" line to the tempfile. Might be a little easier to reason about in the code, but again, I might be missing something. |
@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally): def _read_gisaid_dna_fasta(path):
with open(path) as input_f, \
tempfile.TemporaryFile(mode='w+') as output_f:
lines = None
for line in input_f:
if line.startswith('>'):
if lines is not None:
output_f.writelines(lines)
lines = [line]
elif lines is not None:
# Due to a bug in skbio 0.5.5, the lowercase option can't
# be used with skbio.io.read for reading DNA sequences.
# Convert sequences to uppercase here.
line = line.upper()
# Spaces and gap characters can appear in unaligned GISAID
# sequence records, so we strip those. U characters are
# additionally replaced with T characters.
line = line.replace('-', '')
line = line.replace('.', '')
line = line.replace(' ', '')
line = line.replace('U', 'T')
observed_chars = set(line.strip())
disallowed_chars = observed_chars - skbio.DNA.alphabet
if disallowed_chars:
print('Note: Non-IUPAC DNA characters (%s) in '
'sequence record %s. This record will be '
'excluded from the output.' %
(' '.join(disallowed_chars),
lines[0][1:].split()[0]),
file=sys.stderr)
lines = None
else:
lines.append(line)
else:
continue
if lines is not None:
output_f.writelines(lines)
output_f.seek(0)
result = skbio.io.read(output_f, verify=False,
format='fasta', constructor=skbio.DNA)
return result |
That should work if the helper is inlined into the transformer. Then the |
Solved it in this latest commit in a different way (@ebolyen, I think this is what you were going for when we discussed in Basecamp yesterday). This seems better than my initial version, since the tempfile is being managed with a context manager.
@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! This looks good to me!
I don't think so - I believe that document is currently focused on discussing cutting a new production release, not cutting a development version. I might need to think about this for a minute. |
Okay, I think I remember what we need to do, I'll do it right now, and update the RELEASE document. |
I misunderstood where the issue you were running into was occurring - I thought you were bumping into issues inside the helper, not inside the transformer. It all makes sense to me now, sorry for sending you off on a tangent. I dev bumped the repo, and am waiting on CI to finish. Once green I'll update you here. |
* master: ci: trigger gha VER: 2021.4.0.dev0 doc: updating dev cycle docs
Okay, dev bump was successful, and CI actions are passing here in this PR. Please note, I pushed up a few commits to your feature branch to get things green. |
Awesome, thanks for the help @thermokarst and @ebolyen! |
This does address the memory issue but is probably not the most elegant approach. Note that explicitly closing
fh
or using a context manager results in the file being closed too early, so I'm relying ontempfile
to ensure thatfh
is closed. Open to suggestions on how to better address this.