-
Notifications
You must be signed in to change notification settings - Fork 40
Program parameters
Satria A Kautsar edited this page Oct 25, 2020
·
6 revisions
Parse input datasets, then perform GCF models building (BIRCH clustering) and membership assignment according to the specified threshold (T).
- -i|--input_folder <folder_path>: Path to input folder containing 'datasets.tsv' file and dataset subfolders.
- --resume: Continue the last clustering run (do not specifiy --input_folder in combination with this parameter).
- --complete: When building GCF models, use only complete BGCs (antiSMASH > 4.2 BGCs annotated with 'on_contig_edge' = False).
- --threshold : Clustering threshold (T) used in GCF models building (BIRCH algorithm) and membership assignment. Mutually exclusive with --threshold_pct, use '-1' to turn off this parameter (default: 300).
- --threshold_pct : Calculate clustering threshold (T) based on a random sampling of pairwise distances between the data, taking the N-th percentile value as the threshold. Mutually exclusive with --threshold, use '-1' to turn off this parameter (default: -1).
Given existing GCF models from [Mode 1], perform features extraction and membership assignment from a set of BGC genbank files in the input folder.
- --query <folder_path>: Path to input folder containing allcluster genbank files (needs to be either clusterXXX.gbk of antiSMASH4, regionXXX.gbk of antiSMASH5, or BGCXXXXXXX.gbk of MIBiG >= 2.0 files).
- --query_name : Give a unique name to the query run so that it will be easier to trace within the output visualization.
Parameters relevant for both the Clustering and Query mode.
- --run_id : Rather than taking the latest run, perform query (or resume clustering) on the specific run id (you can check the list of run ids in the output visualization).
- --n_ranks <N_RANKS>: Takes N-best GCF hits for each BGC's membership assignment procedure (default: 5).
- --program_db_folder <PROGRAM_DB_FOLDER>: Path to the HMM libraries (default: /mnt/local_scratch/kauts001/general-env/bin/bigslice-models).
- -t|--num_threads : The number of parallel jobs to run (default: 56).
- --hmmscan_chunk_size : Split biosyn_pfam scanning into chunks of N BGCs (default: 100)
- --subpfam_chunk_size : Split sub_pfam scanning into chunks of N BGCs (default: 100)
- --extraction_chunk_size <EXTRACTION_CHUNK_SIZE>: Split features extraction into chunks of N BGCs (default: 100)
- --scratch: Don't load the Sqlite3 database into memory (lower RAM usage, but potentially slower).