Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) #221

Open
estolle opened this issue Oct 15, 2024 · 2 comments

Comments

@estolle
Copy link

estolle commented Oct 15, 2024

Hi

I am using blobtoolkit as a pipeline on a server and ran it successfully on 2 genomes but now I am getting errors related to the parsing of the blastp results.

My blobtoolkit install was crated like this in a mamba environment (following https://blobtoolkit.genomehubs.org/install/)
pip3 install blobtoolkit[full]
mamba install -c tolkit blobtk

the file where the error pops up:
blobtoolkit/lib/python3.9/site-packages/blobtools/lib/hits.py

the log: data//blobtools/logs//run_blobtools_create.log

Reading all TSV files in ../window_stats
Loading parsed taxdump
Traceback (most recent call last):
  File "/home/ek/progz/conda_envs/blobtoolkit/bin/blobtools", line 8, in <module>
    sys.exit(cli())
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli
    sys.exit(subcommand())
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli
    main(args)
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main
    parsed = field["module"].parse(
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/hits.py", line 541, in parse
    blast = parse_blast(
  File "/home/ek/progz/conda_envs/blobtoolkit/lib/python3.9/site-packages/blobtools/lib/hits.py", line 59, in parse_blast
    "end": int(re.sub(r'\d', '', end)),
ValueError: invalid literal for int() with base 10: '|+'

Diamond_blastp output file (part) causing the error:
Shown are 2 lines only: 2 lines, first line=no problem, second line and following "Contig:start-end" eg AxFerruginea009:9075748-9076881|+=1890943at2=single.
If I remove the "|+" then other errors are thrown.

+       441921  1115    +       tr|A0A6J0BTN4|A0A6J0BTN4_NEOLC  92.3    622     44      1       1       618     10      631     0.0     1115
AxFerruginea009:9075748-9076881|+=1890943at2=single     7460    643     AxFerruginea009:9075748-9076881|+=1890943at2=single     tr|A0A7M7IES9|A0A7M7IES9_APIME     99.7    317     1       0       20      336     710     1026    1.49e-218       643

Other species/genomes have te same error due to similar looking blastp outputs.

In another species which eventually worked, our work around was modifying the parsing "end": int(end), to "end": int(re.sub(r'\d', '', end)),

        if ":" in query and "=" in query:
            # parse blastp
            parts = query.split("=")
            if query in bitscores and score <= bitscores[query]:
                continue
            if len(parts) == 3 and parts[2] == "fragmented":
                continue
            bitscores[query] = score
            seq_id, start, end = re.split(r"[:-]", parts[0])
            hit = {
                "subject": row[cols["sseqid"]],
                "score": score,
                "start": int(start),
                "end": int(re.sub(r'\d', '', end)),
                "file": index,
                "title": parts[1],
            }

I am stuck with the above error now. Any ideas how to fix this?

@estolle estolle changed the title blobtoolkit pipeline error related to parsing of diamond_blatp output (hits.py, "end"=int(end) fails) blobtoolkit pipeline error related to parsing of diamond_blastp output (hits.py, "end"=int(end) fails) Oct 16, 2024
@rjchallis
Copy link
Contributor

Sorry I'd let this one slip by for a while. Based on the comments in #223, it looks like this is being introduced by changes in the newer version of BUSCO, I've push a potential fix, just need to test it in a container build before making a new pip release

@rjchallis
Copy link
Contributor

The fix is now in the 4.4.2 release so hopefully this will work if you grab the latest version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants