Improve runtime #44

sambenfredj · 2022-03-02T15:04:02Z

Expose max workers parameter to cli
update triqler package with the fix for issue #13 (calculating PEPs may be extremely slow for some runs)
make grid search run with multi-threading

…s extremely slow

Reduce computation time

wfondrie

Hi @sambenfredj - thanks for the great PR! 🎉

It looks like the code currently fails our lint tests. One way to fix this is to use Black as a pre-commit hook. To do this, first install pre-commit:

# With pip:
pip install pre-commit

# With conda:
conda install -c conda-forge pre-commit

Once you have pre-commit installed, you can set up the hook with:

pre-commit install

Now black will run automatically with each new commit.

Lastly, we should add a test that incorporates the new max_workers CLI argument. I think it may be too difficult to test whether the argument is truly working, so my suggestion is merely to add it to our basic CLI system test here:

mokapot/tests/system_tests/test_cli.py

Lines 45 to 72 in bed4946

    
           def test_cli_options(tmp_path, scope_files): 
        
               """Test non-defaults""" 
        
               cmd = [ 
        
                   "mokapot", 
        
                   scope_files[0], 
        
                   scope_files[1], 
        
                   "--dest_dir", 
        
                   tmp_path, 
        
                   "--file_root", 
        
                   "blah", 
        
                   "--train_fdr", 
        
                   "0.2", 
        
                   "--test_fdr", 
        
                   "0.1", 
        
                   "--seed", 
        
                   "100", 
        
                   "--direction", 
        
                   "RefactoredXCorr", 
        
                   "--folds", 
        
                   "2", 
        
                   "-v", 
        
                   "1", 
        
                   "--max_iter", 
        
                   "1", 
        
                   "--keep_decoys", 
        
                   "--subset_max_train", 
        
                   "50000", 
        
               ]

Please don't hesitate to reach out if you have any questions!

wfondrie · 2022-03-02T20:01:22Z

mokapot/config.py

+        "--max_workers",
+        default=1,
+        type=int,
+        help="The number of processes to use for model training."


This line needs a trailing comma to pass our lint tests.

wfondrie · 2022-03-02T20:04:20Z

setup.cfg

    matplotlib>=3.1.3
    lxml>=4.6.2
+    triqler @ git+ssh://[email protected]/statisticalbiotechnology/triqler.git


I don't like having a non-PyPI dependency here, although it seems to make a large difference. I'll look into why this update isn't on PyPI yet.

Add cli max_workers test

wfondrie · 2022-03-03T18:38:50Z

It looks like tests are failing currently, because our GitHub Actions workflows are not setup to clone repos with ssh; hence, it fails to install mokapot with the changed Triqler depedency.

I've reached out the Triqler folks to get an ETA on when the recent patch will be in a release. If it is relatively soon, we should revert the dependency change in setup.cfg and instead bump the minimum version to the new release once it is live.

gessulat · 2022-03-03T21:11:21Z

That totally makes sense. We heard that @MatthewThe is on vacation currently and were just excited to get this out as quickly as possible ;)
Thank you for the super quick response!

MatthewThe · 2022-03-04T14:18:35Z

I updated Triqler on PyPI (v0.6.2), let me know if you run into any other issues.

wfondrie · 2022-03-04T16:49:44Z

Thanks @MatthewThe!

@sambenfredj and @gessulat - if you change the triqler dependency to triqler>=0.6.2, the test should be able to run. Once we confirm all of the tests are passing, it'll be ready to merge.

Update triqler version

sambenfredj · 2022-03-07T09:45:45Z

Thanks @wfondrie ! I just updated the triqler to the latest release.

codecov-commenter · 2022-03-07T15:33:52Z

Codecov Report

Merging #44 (06bb9e3) into master (bed4946) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
+ Coverage   82.57%   82.58%   +0.01%     
==========================================
  Files          18       18              
  Lines        1544     1545       +1     
==========================================
+ Hits         1275     1276       +1     
  Misses        269      269

Impacted Files	Coverage Δ
mokapot/model.py	`79.60% <ø> (ø)`
mokapot/mokapot.py	`85.00% <ø> (ø)`
mokapot/config.py	`91.11% <100.00%> (+0.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bed4946...06bb9e3. Read the comment docs.

tkschmidt · 2022-03-09T08:29:32Z

Hey @wfondrie,
I'm taking over for @sambenfredj as she is on vacation. I'm not really firm with the test pipelines in github. Can we trigger it again just to see if it resolves it. As far as I know it worked on our local systems (bad excuse ;))

wfondrie · 2022-03-09T17:21:11Z

I'm a bit stumped as to why the Ubuntu test hangs for hours. I'll try kicking it off again and see if it completes this time.

Either way, I'll merge this into master. However, I'm going to hold off creating a new release if it doesn't work, so that I can troubleshoot it a little more.

wfondrie · 2022-03-11T23:34:13Z

mokapot v0.8.0 is now on PyPI 🎉

It should be up on bioconda as well within the next few days.

wfondrie · 2022-05-02T18:50:20Z

Hi @sambenfredj, @tkschmidt, and @gessulat - I'm planning to tweet about recent updates to things soon and I wanted to mention your contribution here. I can mention you each individually (@sambenfredj - do you have a twitter handle?), MSAID, or both - whatever your preference 😄.

It's also fine if you don't want to be mentioned - please just let me know.

sambenfredj · 2022-05-03T07:55:22Z

Thank you @wfondrie, I am glad to be mentioned. You can find me under the twitter handle SamiaFredj.

gessulat · 2022-05-03T09:51:20Z

Hi @wfondrie,
thank you so much, I would be happy about being mentioned but it was really @sambenfredj who did the all the work. So if it's to much names for a tweet just leave me out ;)

* expose max workers parameter to cli * update triqler package to fix the issue with calculating PEPs which is extremely slow * make grid search run with multithreading * test max_workers cli argument * fix format * update triqler Co-authored-by: Siegfried Gessulat <[email protected]> Co-authored-by: Tobias <[email protected]>

move some logs messages from info to debug level Closes #44 and #46 See merge request msaid/chimerys/mokapot!42

* Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass --------- Co-authored-by: Elmar Zander <[email protected]>

@jspaezp

…ation (#119) * 💄 lint mokapot * 💄 lints tests * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 make ruff and black happy together * Fix problems with nnls * Feature/improve speed and limit memory (#11) Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models * 💄 linting (#12) :lipstick: fix linting * Fix bugs (#17) - fix bug member variables not assigned when model is not trained - allow throw when input file is malformed: remove skip on bad lines from pandas read function * fix test model: remove subset_max_train from percolator model (#18) * Fix test brew: (#20) - Create new object of OnDiskPsmDataset to use for brew tests - Update brew function outputs and assert statements * fix test datasets: (#19) - remove assign confidence tests because datasets don't have assign confidence methods anymore - add eval_fdr value to the _update_labels function * Fix test confidence (#22) * Fix test confidence: - fix bugs for grouped confidence - fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset. - remove test_pickle because confidence does not return dataframe results anymore. - add test_multi_groups to test that different group results are saved correctly. * fix bugs: - overwrite default fdr for update_labels function - return dataframe for psm_df_1000 to use with LinearPsmDataset * Fix cli tests: (#28) - Remove test_cli_pepxml because xml files don't work with streaming - Replace old output file names - Add random generator 'rng' variable to confidence since it is required for proteins - Remove subset_max_train from PluginModel - Fix bug: convert targets column after reading in chunks - Fix peptide column name for confidence - Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1 * Fix system tests: (#29) - Refactor test structure : Separate brew and confidence functions, read results from output files. - Fix bugs: fix output columns for proteins, sort proteins data by score * Fix parser pin test: (#30) - Add label value to initial direction because it has to have a numerical number - Read pin does not return dataframe anymore - Compare output of read_pin function to example dataframe * Add tests: (#31) - Add skip_deduplication flag test - Add ensemble flag test - Agg rescale flag test - Fix bug: remove target_column variable from read file for read_data_for_rescale * Fix writer tests: (#32) - Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence * fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse (#33) * Introduce new executable and bug fixes * Create new executable to aggregate psms to peptides. * Fix bugs: - fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse - raise error when pep values are all equal to 1 - prefixes paths to dest_dir to not pollute the workdir - catch error to prevent traces logged: Catch all errors to not break structured logging by error traces - fixes parallelism in parse_in_chunks to max_workers - fix indeterminism - fixed small column chunk bug - fix bug when using multiple input files * Fix and add tests: - remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence - add test for the new function "get_unique_peptides_from_psms" - add cli test for aggregatePsmsToPeptides * ✨ force ci re-run * 💄 lint mokapot * 💄 lints tests * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 make ruff and black happy together * ✨ removed deprecated error ignore * Fix two boolean conditions in nnls algorithm * Set tolerance to fixed value in fit_nnls to avoid non-convergence * Adjust unittest for hist_nnls to new error cases * Add documentation and test for create_chunks * Make cli unit tests for aggregateP2P easier debuggable * Improve test for peptide_csv in test_utils * Improve and test convert_targets_column function And fix some minor pep8 issues * Enable switching in system tests from subprocess to direct calls * Fix cli system and utils tests * Fix unit tests * Add documentation and test for create_chunks * Make cli unit tests for aggregateP2P easier debuggable * Improve test for peptide_csv in test_utils * Improve and test convert_targets_column function And fix some minor pep8 issues * Enable switching in system tests from subprocess to direct calls * Fix cli system and utils tests * Fix unit tests * parquet reader for mokapot * merge sort function adapted for parquet * brew function adapted for parquet input * confidence assignment modified for parquet format * merge sort chunk size added as constant * update label func modified for parquet * main function uses format arg to choose between csv and parquet * pyarrow added to dependancies * Change conversion of target column values Revert it to "old style" because otherwise some tests break * fixed failing tests * added new tests for parquet * refactor: Add type hinting to tuplize function * refactor: Refactor find_column(s) functions and use it in read_percolator * refactor: Insert some newlines for improved readability * refactor: Insert some newlines for improved readability * refactor: Insert some newlines for improved readability refactor: Insert some newlines for improved readability * refactor: Remove redundant test case for case-sensitive column matching. * Add typeguard * refactored unchunked file reader for parquet and csv * Add map_columns_to_indices function and more type checking * Make debugging dataframe issues easier in unit test by adding some pandas config options to pytest_sessionstart in conftest.py * Fix test_utils * Add level_columns to OnDiskPsmDataset * Rename deduplication to do_rollup * Change deduplication to do_rollup * Fix pin reading by adding level_columns * Revert "Merge branch 'feature/parquet_parser' into 'main'" This reverts merge request !39 * Move peptides.csv to dest_dir and remove it where it's created * Save changes * Get rid of path manipulation via strings * Clean up more path related stuff * Correct documentation of return values of brew function * Simplify and generalize path definitions in confidence.py * Move confidence related functions to confidence * Fix problem with parameter lists in cli tests * Add checking of column names for OnDiskPsmDataset * Fix column index stuff * Disable parallel unit tests when debugging * Refactor the confidence.to_txt function * Fix chunked reader to read in column order as passed to the function * Add comments and put temp file naming for merge-sorting in one place * Add test for chunked reader * Fix warning in read_file_in_chunks test * Remove unnecessary conversion (back and fro) in confidence.py * Add pyarrow as a dependency * Remove superfluous conversions and some more superfluous stuff * Improve find_column* and use consistently in pin parser * Introduce tabbed reader and writers (for csv for now) * Correct buggy import statement * Fix bug and add test related to --aggregate flag * Remove ignoring of warnings Warnings should not be ignored, except when the verbosity is set to error level. If there are annoying and known warnings, that are handled appropriately, they should be ignored locally on a case by case basis. * Comment regarding to_txt function (can be removed) * Add file type detection to readers and writers * Ignore warning in PIN reader (locally now) * Improve column ordering/mapping and add unit test * Correct targets conversion function and fix offending unit tests * Improve on label updating and type safety * Correct output capturing in test helper * Use TabbedWriter in save_sorted_metadata_chunks * Progress towards rollup * Improve map_columns_to_indices for dicts * Add stringify methods to tabbed readers * Change assign_confidence for rollup * Fix column ordering problem (temp) * Add tests for the rollup * Fix a bug in the rollup unit test * Rename pcms to precursors * Remove a now superfluous function and comments * squashed sqlite writer branch * failing test fixed * unused function get_unique_peptides_from_psms removed * format interpreted implicitly using filename * Confidencewriter class implemented * sqlite path changed to Path type * perquet module deleted and integrated into read_pin * failing tests fixed * instantiation done before returning object * PSM_PEP column name changed to POSTERIOR_ERROR_PROBABILITY * add pipeline status for main branch * ✨ fixes #54 * Do some cosmetics * Remove rescale stuff Was broken and unused anyway * Renamed TabbedFileReader and Writer to TabularDataReader and Writer * Separate general tabular data and confidence writer stuff * Fix import bug in confidence * Fix another import bug Due to pycharm refactoring... sigh... * Add better unit test for (chunked) confidence * Make chunked confidence unit test fail with small chunk size * Fix confidence chunk size bug * test case added for sqlite writer * test data added for sqlite writer * Add option for suppressing warnings * Revert the change in confidence and adapt unit test CONFIDENCE_CHUNK_SIZE is now directly modified from the unit test. * prepare tables sqlite db added as helper func * Fix problems with sqlit after the merge * Remove all group related stuff * Remove crosslink stuff * Remove plugins * Remove skipped tests and skip marks * Do some minor cleanup * Add type inference and tests for tabular data * Add reader and tests for in memory dataframe reader * Add streaming module * Add checks to merged reader * Add creation of dataframe reader from series and arrays * Add JoinedTabularData and tests * Add column renaming for tabular data readers * Add context manager to tabular data writer and make confidence writer a function * Get rid of all kinds of warnings during tests * Fix another warning * Add context manager to TabularDataWriter * Fix a problem with indexing in the merged reader * Add buffering to writers * Correct problem in unit test with log output and typechecking * Add functionality to add computed columns to TabularData * Add method to get an associated reader from a writer * Add cli and test for the rollup * Fix underscore problem * Fix problem with typechecked/contextmanager order * Remove typechecking from auto_finalizer for the moment * Fix path problem in rollup unit test * Show rollup levels not found only if non-empty * Simplify sqlite connection in unit tests * Add new suffixes for csv * Change options: add src_dir and remove keep_decoys * Add files for rollup testing * buffered write for parquet intermediary files implemented * tests updated for parquet writing * fixed aggregatePsmstoPeptides cli * test data structure changed to list of dicts to match merge sort output structure * test data updated to be dataframe readable * Remove aggregatePsmsToPeptides * Move remove_columns function to tabular_data * Fix program name in cli output * Let brew_rollup also search for parquet files * Use csv or parquet suffix also for temp and output files * Filter a warning in the system tests * Make the column types a bit more lenient * fixed rollup app for parquets * Fix unclosed files problem * Remove unused parameter target_column from merge_sort * Remove superfluous passing of sep * Change tabs to colons in protein(s) column of pin file * Test parquet merge_sort more extensively * Unify csv and parquet methods in merge_sort * Simplify get_row_iterator * Make brew rollup faster * Fix bug in MergedTabularReader * Fix problem with last line in buffering * Fix problem with type conversions in merge_sort (And make it a bit less ugly...) * ✨ addresses @jspaezp suggestions from PR #119 - removes MSAID internal status banner in README - removes @TypeChecked for brew() (we can reintroduce it when we have a typed version of brew() - uses yield from pattern - uses dictionaries to assert for correct column-name and type mappings - removes installation of sample-plugin from tests * ✨ addresses review suggestions - adds back typechecked for brew - uses parameterized tests for test_peps_hist_nnls * Add check for length of mokapot output file * Make test for output file length more "elastic" * Revert documentation on psms parameter for brew function. And add some additional info. * Change f-string to normal string where unnecessary * Remove types_from_dataframe function, since unnecessary * (chore) updated cicd,ruff,black and tests (#42) * (chore) updated cicd, ruff, black and tests * resolved comment question by @jspaezp * ✨ fix setting scores when training failed * Readd previously commented out check for feature columns * ✨ proper docstring for `sqlite_db_path` * Add doc strings for write_confidence (and improve types) * Improve documentation and type hints of assign_confidence * Add class and module documentation for the tabular data classes * Feature/remove nnls patch (#43) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks Co-authored-by: Elmar Zander <[email protected]> * Fix/windows tests (#44) * Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass --------- Co-authored-by: Elmar Zander <[email protected]> * Fix/windows tests (#45) * Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass * 💄 linting --------- Co-authored-by: Elmar Zander <[email protected]> * ✨ draft of pin to tsv converter * ✨ adds is_valid_tsv * ✨ adds tsv verification for pin files and conversion * 📝 remove print * 🔥 add required default for --dest_dir - using Mokapot without --dest_dir was broken without this new default before --------- Co-authored-by: Siegfried Gessulat <[email protected]> Co-authored-by: Elmar Zander <[email protected]> Co-authored-by: sambenfredj <[email protected]> Co-authored-by: Elmar Zander <[email protected]> Co-authored-by: Vishal Sukumar <[email protected]> Co-authored-by: Florian Seefried <[email protected]> Co-authored-by: Graber Michael <[email protected]> Co-authored-by: Tobias Schmidt <[email protected]> Co-authored-by: J. Sebastian Paez <[email protected]>

sambenfredj and others added 4 commits March 1, 2022 10:21

expose max workers parameter to cli

a3043c6

update triqler package to fix the issue with calculating PEPs which i…

72b59b2

…s extremely slow

make grid search run with multithreading

3a5c276

Merge pull request #1 from msaid-de/reduce_computation_time

bf25400

Reduce computation time

wfondrie self-requested a review March 2, 2022 20:00

wfondrie requested changes Mar 2, 2022

View reviewed changes

sambenfredj and others added 3 commits March 3, 2022 12:25

test max_workers cli argument

271617e

fix format

8116e10

Merge pull request #2 from msaid-de/add-cli-maxworkers-test

e3932a7

Add cli max_workers test

sambenfredj and others added 2 commits March 7, 2022 09:02

update triqler

77795d2

Merge pull request #3 from msaid-de/update-triqler-version

06bb9e3

Update triqler version

wfondrie merged commit 1c76a7e into wfondrie:master Mar 9, 2022

wfondrie mentioned this pull request Mar 11, 2022

Prepare release v0.8.0 #45

Merged

sambenfredj referenced this pull request in msaid-de/mokapot Apr 12, 2023

Merge branch 'refactor/minimal_logging' into 'develop'

9bec320

move some logs messages from info to debug level Closes #44 and #46 See merge request msaid/chimerys/mokapot!42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve runtime #44

Improve runtime #44

sambenfredj commented Mar 2, 2022

wfondrie left a comment

wfondrie Mar 2, 2022

wfondrie Mar 2, 2022

wfondrie commented Mar 3, 2022

gessulat commented Mar 3, 2022

MatthewThe commented Mar 4, 2022

wfondrie commented Mar 4, 2022

sambenfredj commented Mar 7, 2022

codecov-commenter commented Mar 7, 2022 •

edited

Loading

tkschmidt commented Mar 9, 2022

wfondrie commented Mar 9, 2022

wfondrie commented Mar 11, 2022

wfondrie commented May 2, 2022

sambenfredj commented May 3, 2022

gessulat commented May 3, 2022 •

edited

Loading

	def test_cli_options(tmp_path, scope_files):
	"""Test non-defaults"""
	cmd = [
	"mokapot",
	scope_files[0],
	scope_files[1],
	"--dest_dir",
	tmp_path,
	"--file_root",
	"blah",
	"--train_fdr",
	"0.2",
	"--test_fdr",
	"0.1",
	"--seed",
	"100",
	"--direction",
	"RefactoredXCorr",
	"--folds",
	"2",
	"-v",
	"1",
	"--max_iter",
	"1",
	"--keep_decoys",
	"--subset_max_train",
	"50000",
	]

Improve runtime #44

Improve runtime #44

Conversation

sambenfredj commented Mar 2, 2022

wfondrie left a comment

Choose a reason for hiding this comment

wfondrie Mar 2, 2022

Choose a reason for hiding this comment

wfondrie Mar 2, 2022

Choose a reason for hiding this comment

wfondrie commented Mar 3, 2022

gessulat commented Mar 3, 2022

MatthewThe commented Mar 4, 2022

wfondrie commented Mar 4, 2022

sambenfredj commented Mar 7, 2022

codecov-commenter commented Mar 7, 2022 • edited Loading

Codecov Report

tkschmidt commented Mar 9, 2022

wfondrie commented Mar 9, 2022

wfondrie commented Mar 11, 2022

wfondrie commented May 2, 2022

sambenfredj commented May 3, 2022

gessulat commented May 3, 2022 • edited Loading

codecov-commenter commented Mar 7, 2022 •

edited

Loading

gessulat commented May 3, 2022 •

edited

Loading