address memory issue with large GISAID files #94

gregcaporaso · 2021-03-10T17:01:24Z

This does address the memory issue but is probably not the most elegant approach. Note that explicitly closing fh or using a context manager results in the file being closed too early, so I'm relying on tempfile to ensure that fh is closed. Open to suggestions on how to better address this.

thermokarst · 2021-03-10T17:20:31Z

Note that explicitly closing fh or using a context manager results in the file being closed too early

Wow, even if you call fh.close() after the skbio.io.read call?

thermokarst · 2021-03-10T17:21:18Z

PS - this plugin will need a dev-bump in order to get CI passing.

gregcaporaso · 2021-03-10T17:56:57Z

Wow, even if you call fh.close() after the skbio.io.read call?

Yep, just confirmed again locally. This is one of the four test failures that I get in that case (all result in the same ValueError):

__________ GISAIDDNAFASTAFormatTransformerTests.test_transformer_sequence_exclusion_last_record ___________

self = <genome_sampler.tests.test_transformers.GISAIDDNAFASTAFormatTransformerTests testMethod=test_transformer_sequence_exclusion_last_record>

    def test_transformer_sequence_exclusion_last_record(self):
        input, obs = self.transform_format(GISAIDDNAFASTAFormat,
                                           DNASequencesDirectoryFormat,
>                                          filename='gisaid4.fasta')

genome_sampler/tests/test_transformers.py:79:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/qiime2/plugin/testing.py:244: in transform_format
    obs = transformer(input)
genome_sampler/plugin_setup.py:174: in _4
    skbio.io.write(data, format='fasta', into=file)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1166: in write
    return io_registry.write(obj, format, into, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:619: in write
    writer(obj, into, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1082: in wrapped_writer
    writer_function(obj, fhs[-1], **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:774: in _generator_to_fasta
    for header, seq_str, qual_scores in formatted_records:
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/_base.py:147: in _format_fasta_like_records
    for idx, seq in enumerate(generator):
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:506: in <genexpr>
    return (x for x in itertools.chain([next(gen)], gen))
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:531: in _read_gen
    yield from reader(file, **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/registry.py:1008: in wrapped_reader
    yield from reader_function(fhs[-1], **kwargs)
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:675: in _fasta_to_generator
    FASTAFormatError):
../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/fasta.py:853: in _parse_fasta_raw
    for line in _line_generator(fh, skip_blanks=False):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fh = <_io.TextIOWrapper name=13 mode='w+' encoding='UTF-8'>, skip_blanks = False, strip = True

    def _line_generator(fh, skip_blanks=False, strip=True):
>       for line in fh:
E       ValueError: I/O operation on closed file.

../../miniconda3/envs/genome-sampler-dev/lib/python3.6/site-packages/skbio/io/format/_base.py:193: ValueError
------------------------------------------ Captured stderr call -------------------------------------------

ebolyen · 2021-03-10T18:07:32Z

I think that makes sense, as io.read will be returning a generator (that then gets written), so it isn't going to read from the file right away, if close is called, then io.read won't have anything to look at when it is consumed by io.write.

This is all pretty confusing, but I like the temp-file approach. The extra disk-IO is a bummer, but it beats the memory requirements for the original.

The only other alternative would be to create the skbio.DNA objects in the generator (by re-implementing io.read for fasta) which doesn't sound ideal.

thermokarst · 2021-03-10T18:08:48Z

Oh nice, yeah I didn't realize that was consuming a generator - makes sense!

ebolyen · 2021-03-10T18:08:56Z

Another approach would be to return the fh from the helper and then close it after io.write in the transformer, then you don't have to "rely" on the garbage collection.

thermokarst · 2021-03-10T19:55:29Z

So this might be a silly question, and I'm not super familiar with this part of the code in this project, but why not just refactor _cleanup_gen to directly write to a new tempfile, line by line, and then pass that new tempfile into the final skbio read?

diff --git a/genome_sampler/plugin_setup.py b/genome_sampler/plugin_setup.py
index b7f4ec4..9ecb9fe 100644
--- a/genome_sampler/plugin_setup.py
+++ b/genome_sampler/plugin_setup.py
@@ -112,47 +112,49 @@ def _3(fmt: IDSelectionDirFmt) -> IDSelection:


 def _read_gisaid_dna_fasta(path):
-    def _cleanup_gen():
-        with open(path) as input_f:
-            lines = None
-            for line in input_f:
-                if line.startswith('>'):
-                    if lines is not None:
-                        yield from lines
-                    lines = [line]
-                elif lines is not None:
-                    # Due to a bug in skbio 0.5.5, the lowercase option can't
-                    # be used with skbio.io.read for reading DNA sequences.
-                    # Convert sequences to uppercase here.
-                    line = line.upper()
-                    # Spaces and gap characters can appear in unaligned GISAID
-                    # sequence records, so we strip those. U characters are
-                    # additionally replaced with T characters.
-                    line = line.replace('-', '')
-                    line = line.replace('.', '')
-                    line = line.replace(' ', '')
-                    line = line.replace('U', 'T')
-
-                    observed_chars = set(line.strip())
-                    disallowed_chars = observed_chars - skbio.DNA.alphabet
-                    if disallowed_chars:
-                    # Spaces and gap characters can appear in unaligned GISAID
-                    # sequence records, so we strip those. U characters are
-                    # additionally replaced with T characters.
-                    line = line.replace('-', '')
-                    line = line.replace('.', '')
-                    line = line.replace(' ', '')
-                    line = line.replace('U', 'T')
-
-                    observed_chars = set(line.strip())
-                    disallowed_chars = observed_chars - skbio.DNA.alphabet
-                    if disallowed_chars:
-                        print('Note: Non-IUPAC DNA characters (%s) in '
-                              'sequence record %s. This record will be '
-                              'excluded from the output.' %
-                              (' '.join(disallowed_chars),
-                               lines[0][1:].split()[0]),
-                              file=sys.stderr)
-                        lines = None
-                    else:
-                        lines.append(line)
+    with open(path) as input_f, \
+            with tempfile.TemporaryFile(mode='w+') as output_f:
+        lines = None
+        for line in input_f:
+            if line.startswith('>'):
+                if lines is not None:
+                    for l in lines:
+                        output_f.write(l)
+                lines = [line]
+            elif lines is not None:
+                # Due to a bug in skbio 0.5.5, the lowercase option can't
+                # be used with skbio.io.read for reading DNA sequences.
+                # Convert sequences to uppercase here.
+                line = line.upper()
+                # Spaces and gap characters can appear in unaligned GISAID
+                # sequence records, so we strip those. U characters are
+                # additionally replaced with T characters.
+                line = line.replace('-', '')
+                line = line.replace('.', '')
+                line = line.replace(' ', '')
+                line = line.replace('U', 'T')
+
+                observed_chars = set(line.strip())
+                disallowed_chars = observed_chars - skbio.DNA.alphabet
+                if disallowed_chars:
+                    print('Note: Non-IUPAC DNA characters (%s) in '
+                          'sequence record %s. This record will be '
+                          'excluded from the output.' %
+                          (' '.join(disallowed_chars),
+                           lines[0][1:].split()[0]),
+                          file=sys.stderr)
+                    lines = None
                 else:
-                    continue
+                    lines.append(line)
+            else:
+                continue

-            if lines is not None:
-                yield from lines
+        if lines is not None:
+            for l in lines:
+                output_f.write(l)

-    result = skbio.io.read(_cleanup_gen(), verify=False,
-                           format='fasta', constructor=skbio.DNA)
+        result = skbio.io.read(output_f, verify=False,
+                               format='fasta', constructor=skbio.DNA)
     return result

The patch above is untested, but basically just gets rid of the closure and writes each "cleaned up" line to the tempfile. Might be a little easier to reason about in the code, but again, I might be missing something.

gregcaporaso · 2021-03-11T20:33:33Z

@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally):

def _read_gisaid_dna_fasta(path):
    with open(path) as input_f, \
         tempfile.TemporaryFile(mode='w+') as output_f:
        lines = None
        for line in input_f:
            if line.startswith('>'):
                if lines is not None:
                    output_f.writelines(lines)
                lines = [line]
            elif lines is not None:
                # Due to a bug in skbio 0.5.5, the lowercase option can't
                # be used with skbio.io.read for reading DNA sequences.
                # Convert sequences to uppercase here.
                line = line.upper()
                # Spaces and gap characters can appear in unaligned GISAID
                # sequence records, so we strip those. U characters are
                # additionally replaced with T characters.
                line = line.replace('-', '')
                line = line.replace('.', '')
                line = line.replace(' ', '')
                line = line.replace('U', 'T')

                observed_chars = set(line.strip())
                disallowed_chars = observed_chars - skbio.DNA.alphabet
                if disallowed_chars:
                    print('Note: Non-IUPAC DNA characters (%s) in '
                            'sequence record %s. This record will be '
                            'excluded from the output.' %
                            (' '.join(disallowed_chars),
                            lines[0][1:].split()[0]),
                            file=sys.stderr)
                    lines = None
                else:
                    lines.append(line)
            else:
                continue

        if lines is not None:
            output_f.writelines(lines)

        output_f.seek(0)
        result = skbio.io.read(output_f, verify=False,
                            format='fasta', constructor=skbio.DNA)
    return result

ebolyen · 2021-03-11T20:37:16Z

That should work if the helper is inlined into the transformer. Then the with statement can have io.write inside of it.

gregcaporaso · 2021-03-11T20:54:21Z

Solved it in this latest commit in a different way (@ebolyen, I think this is what you were going for when we discussed in Basecamp yesterday). This seems better than my initial version, since the tempfile is being managed with a context manager.

PS - this plugin will need a dev-bump in order to get CI passing.

@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo?

ebolyen

Yep! This looks good to me!

thermokarst · 2021-03-11T20:58:59Z

@thermokarst, are the instructions for doing this the ones in RELEASING.md in this repo?

I don't think so - I believe that document is currently focused on discussing cutting a new production release, not cutting a development version. I might need to think about this for a minute.

thermokarst · 2021-03-11T21:00:39Z

Okay, I think I remember what we need to do, I'll do it right now, and update the RELEASE document.

thermokarst · 2021-03-11T21:10:44Z

@thermokarst, your suggestion still runs into the file I/O on a closed file (at least if I got it right when testing locally):

I misunderstood where the issue you were running into was occurring - I thought you were bumping into issues inside the helper, not inside the transformer. It all makes sense to me now, sorry for sending you off on a tangent.

I dev bumped the repo, and am waiting on CI to finish. Once green I'll update you here.

* master: ci: trigger gha VER: 2021.4.0.dev0 doc: updating dev cycle docs

thermokarst · 2021-03-11T22:10:32Z

Okay, dev bump was successful, and CI actions are passing here in this PR. Please note, I pushed up a few commits to your feature branch to get things green.

gregcaporaso · 2021-03-11T22:12:16Z

Awesome, thanks for the help @thermokarst and @ebolyen!

address memory issue with large GISAID files

c69de2c

gregcaporaso assigned ebolyen Mar 10, 2021

squash

114b160

squash

a2fd21c

ebolyen approved these changes Mar 11, 2021

View reviewed changes

thermokarst added 2 commits March 11, 2021 14:41

Merge branch 'master' into import-memory

1cbb47e

* master: ci: trigger gha VER: 2021.4.0.dev0 doc: updating dev cycle docs

squash: change out gha

496c0ef

ebolyen marked this pull request as ready for review March 11, 2021 23:00

ebolyen merged commit 92a7acb into caporaso-lab:master Mar 11, 2021

gregcaporaso deleted the import-memory branch March 15, 2021 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

address memory issue with large GISAID files #94

address memory issue with large GISAID files #94

gregcaporaso commented Mar 10, 2021

thermokarst commented Mar 10, 2021

thermokarst commented Mar 10, 2021

gregcaporaso commented Mar 10, 2021 •

edited

Loading

ebolyen commented Mar 10, 2021

thermokarst commented Mar 10, 2021

ebolyen commented Mar 10, 2021

thermokarst commented Mar 10, 2021

gregcaporaso commented Mar 11, 2021

ebolyen commented Mar 11, 2021

gregcaporaso commented Mar 11, 2021

ebolyen left a comment

thermokarst commented Mar 11, 2021 •

edited

Loading

thermokarst commented Mar 11, 2021

thermokarst commented Mar 11, 2021

thermokarst commented Mar 11, 2021

gregcaporaso commented Mar 11, 2021

address memory issue with large GISAID files #94

address memory issue with large GISAID files #94

Conversation

gregcaporaso commented Mar 10, 2021

thermokarst commented Mar 10, 2021

thermokarst commented Mar 10, 2021

gregcaporaso commented Mar 10, 2021 • edited Loading

ebolyen commented Mar 10, 2021

thermokarst commented Mar 10, 2021

ebolyen commented Mar 10, 2021

thermokarst commented Mar 10, 2021

gregcaporaso commented Mar 11, 2021

ebolyen commented Mar 11, 2021

gregcaporaso commented Mar 11, 2021

ebolyen left a comment

Choose a reason for hiding this comment

thermokarst commented Mar 11, 2021 • edited Loading

thermokarst commented Mar 11, 2021

thermokarst commented Mar 11, 2021

thermokarst commented Mar 11, 2021

gregcaporaso commented Mar 11, 2021

gregcaporaso commented Mar 10, 2021 •

edited

Loading

thermokarst commented Mar 11, 2021 •

edited

Loading