blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) #221

estolle · 2024-10-15T12:57:51Z

Hi

I am using blobtoolkit as a pipeline on a server and ran it successfully on 2 genomes but now I am getting errors related to the parsing of the blastp results.

My blobtoolkit install was crated like this in a mamba environment (following https://blobtoolkit.genomehubs.org/install/)
pip3 install blobtoolkit[full]
mamba install -c tolkit blobtk

the file where the error pops up:
blobtoolkit/lib/python3.9/site-packages/blobtools/lib/hits.py

the log: data//blobtools/logs//run_blobtools_create.log

Reading all TSV files in ../window_stats
Loading parsed taxdump
Traceback (most recent call last):
  File "/home/ek/progz/conda_envs/blobtoolkit/bin/blobtools", line 8, in <module>
    sys.exit(cli())
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli
    sys.exit(subcommand())
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli
    main(args)
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main
    parsed = field["module"].parse(
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/hits.py", line 541, in parse
    blast = parse_blast(
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/hits.py", line 59, in parse_blast
    "end": int(re.sub(r'\d', '', end)),
ValueError: invalid literal for int() with base 10: '|+'

Diamond_blastp output file (part) causing the error:
Shown are 2 lines only: 2 lines, first line=no problem, second line and following "Contig:start-end" eg AxFerruginea009:9075748-9076881|+=1890943at2=single.
If I remove the "|+" then other errors are thrown.

+       441921  1115    +       tr|A0A6J0BTN4|A0A6J0BTN4_NEOLC  92.3    622     44      1       1       618     10      631     0.0     1115
AxFerruginea009:9075748-9076881|+=1890943at2=single     7460    643     AxFerruginea009:9075748-9076881|+=1890943at2=single     tr|A0A7M7IES9|A0A7M7IES9_APIME     99.7    317     1       0       20      336     710     1026    1.49e-218       643

Other species/genomes have te same error due to similar looking blastp outputs.

In another species which eventually worked, our work around was modifying the parsing "end": int(end), to "end": int(re.sub(r'\d', '', end)),

        if ":" in query and "=" in query:
            # parse blastp
            parts = query.split("=")
            if query in bitscores and score <= bitscores[query]:
                continue
            if len(parts) == 3 and parts[2] == "fragmented":
                continue
            bitscores[query] = score
            seq_id, start, end = re.split(r"[:-]", parts[0])
            hit = {
                "subject": row[cols["sseqid"]],
                "score": score,
                "start": int(start),
                "end": int(re.sub(r'\d', '', end)),
                "file": index,
                "title": parts[1],
            }

I am stuck with the above error now. Any ideas how to fix this?

The text was updated successfully, but these errors were encountered:

#221, #223

rjchallis · 2024-12-11T09:40:29Z

Sorry I'd let this one slip by for a while. Based on the comments in #223, it looks like this is being introduced by changes in the newer version of BUSCO, I've push a potential fix, just need to test it in a container build before making a new pip release

rjchallis · 2025-01-22T09:01:12Z

The fix is now in the 4.4.2 release so hopefully this will work if you grab the latest version

estolle changed the title ~~blobtoolkit pipeline error related to parsing of diamond_blatp output (hits.py, "end"=int(end) fails)~~ blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) Oct 16, 2024

rjchallis added a commit that referenced this issue Dec 11, 2024

remove trailing characters from coordinates when parsing blastp results

00da903

#221, #223

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) #221

blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) #221

estolle commented Oct 15, 2024 •

edited

Loading

rjchallis commented Dec 11, 2024

rjchallis commented Jan 22, 2025

blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) #221

blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) #221

Comments

estolle commented Oct 15, 2024 • edited Loading

rjchallis commented Dec 11, 2024

rjchallis commented Jan 22, 2025

estolle commented Oct 15, 2024 •

edited

Loading