Program parameters

Run modes

Parse input datasets, then perform GCF models building (BIRCH clustering) and membership assignment according to the specified threshold (T).

-i|--input_folder <folder_path>: Path to input folder containing 'datasets.tsv' file and dataset subfolders.
--resume: Continue the last clustering run (do not specifiy --input_folder in combination with this parameter).
--complete: When building GCF models, use only complete BGCs (antiSMASH > 4.2 BGCs annotated with 'on_contig_edge' = False).
--threshold : Clustering threshold (T) used in GCF models building (BIRCH algorithm) and membership assignment. Mutually exclusive with --threshold_pct, use '-1' to turn off this parameter (default: 300).
--threshold_pct : Calculate clustering threshold (T) based on a random sampling of pairwise distances between the data, taking the N-th percentile value as the threshold. Mutually exclusive with --threshold, use '-1' to turn off this parameter (default: -1).

Given existing GCF models from [Mode 1], perform features extraction and membership assignment from a set of BGC genbank files in the input folder.

--query <folder_path>: Path to input folder containing allcluster genbank files (needs to be either clusterXXX.gbk of antiSMASH4, regionXXX.gbk of antiSMASH5, or BGCXXXXXXX.gbk of MIBiG >= 2.0 files).
--query_name : Give a unique name to the query run so that it will be easier to trace within the output visualization.

Parameters relevant for both the Clustering and Query mode.

--run_id : Rather than taking the latest run, perform query (or resume clustering) on the specific run id (you can check the list of run ids in the output visualization).
--n_ranks <N_RANKS>: Takes N-best GCF hits for each BGC's membership assignment procedure (default: 5).
--program_db_folder <PROGRAM_DB_FOLDER>: Path to the HMM libraries (default: /mnt/local_scratch/kauts001/general-env/bin/bigslice-models).

-t|--num_threads : The number of parallel jobs to run (default: 56).
--hmmscan_chunk_size : Split biosyn_pfam scanning into chunks of N BGCs (default: 100)
--subpfam_chunk_size : Split sub_pfam scanning into chunks of N BGCs (default: 100)
--extraction_chunk_size <EXTRACTION_CHUNK_SIZE>: Split features extraction into chunks of N BGCs (default: 100)
--scratch: Don't load the Sqlite3 database into memory (lower RAM usage, but potentially slower).