Skip to content

Program parameters

Satria A Kautsar edited this page Oct 25, 2020 · 6 revisions

Run modes

[Mode 1] Clustering analysis:

Parse input datasets, then perform GCF models building (BIRCH clustering) and membership assignment according to the specified threshold (T).

  • -i|--input_folder <folder_path>: Path to input folder containing 'datasets.tsv' file and dataset subfolders.
  • --resume: Continue the last clustering run (do not specifiy --input_folder in combination with this parameter).
  • --complete: When building GCF models, use only complete BGCs (antiSMASH > 4.2 BGCs annotated with 'on_contig_edge' = False).
  • --threshold : Clustering threshold (T) used in GCF models building (BIRCH algorithm) and membership assignment. Mutually exclusive with --threshold_pct, use '-1' to turn off this parameter (default: 300).
  • --threshold_pct : Calculate clustering threshold (T) based on a random sampling of pairwise distances between the data, taking the N-th percentile value as the threshold. Mutually exclusive with --threshold, use '-1' to turn off this parameter (default: -1).

[Mode 2] GCF queries:

Given existing GCF models from [Mode 1], perform features extraction and membership assignment from a set of BGC genbank files in the input folder.

  • --query <folder_path>: Path to input folder containing allcluster genbank files (needs to be either clusterXXX.gbk of antiSMASH4, regionXXX.gbk of antiSMASH5, or BGCXXXXXXX.gbk of MIBiG >= 2.0 files).
  • --query_name : Give a unique name to the query run so that it will be easier to trace within the output visualization.

[Mode 1+2]:

Parameters relevant for both the Clustering and Query mode.

  • --run_id : Rather than taking the latest run, perform query (or resume clustering) on the specific run id (you can check the list of run ids in the output visualization).
  • --n_ranks <N_RANKS>: Takes N-best GCF hits for each BGC's membership assignment procedure (default: 5).
  • --program_db_folder <PROGRAM_DB_FOLDER>: Path to the HMM libraries (default: /mnt/local_scratch/kauts001/general-env/bin/bigslice-models).

CPU/RAM usage optimization

  • -t|--num_threads : The number of parallel jobs to run (default: 56).
  • --hmmscan_chunk_size : Split biosyn_pfam scanning into chunks of N BGCs (default: 100)
  • --subpfam_chunk_size : Split sub_pfam scanning into chunks of N BGCs (default: 100)
  • --extraction_chunk_size <EXTRACTION_CHUNK_SIZE>: Split features extraction into chunks of N BGCs (default: 100)
  • --scratch: Don't load the Sqlite3 database into memory (lower RAM usage, but potentially slower).
Clone this wiki locally