-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve runtime #44
Improve runtime #44
Conversation
sambenfredj
commented
Mar 2, 2022
- Expose max workers parameter to cli
- update triqler package with the fix for issue #13 (calculating PEPs may be extremely slow for some runs)
- make grid search run with multi-threading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sambenfredj - thanks for the great PR! 🎉
It looks like the code currently fails our lint tests. One way to fix this is to use Black as a pre-commit hook. To do this, first install pre-commit:
# With pip:
pip install pre-commit
# With conda:
conda install -c conda-forge pre-commit
Once you have pre-commit installed, you can set up the hook with:
pre-commit install
Now black will run automatically with each new commit.
Lastly, we should add a test that incorporates the new max_workers
CLI argument. I think it may be too difficult to test whether the argument is truly working, so my suggestion is merely to add it to our basic CLI system test here:
mokapot/tests/system_tests/test_cli.py
Lines 45 to 72 in bed4946
def test_cli_options(tmp_path, scope_files): | |
"""Test non-defaults""" | |
cmd = [ | |
"mokapot", | |
scope_files[0], | |
scope_files[1], | |
"--dest_dir", | |
tmp_path, | |
"--file_root", | |
"blah", | |
"--train_fdr", | |
"0.2", | |
"--test_fdr", | |
"0.1", | |
"--seed", | |
"100", | |
"--direction", | |
"RefactoredXCorr", | |
"--folds", | |
"2", | |
"-v", | |
"1", | |
"--max_iter", | |
"1", | |
"--keep_decoys", | |
"--subset_max_train", | |
"50000", | |
] |
Please don't hesitate to reach out if you have any questions!
mokapot/config.py
Outdated
"--max_workers", | ||
default=1, | ||
type=int, | ||
help="The number of processes to use for model training." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs a trailing comma to pass our lint tests.
setup.cfg
Outdated
matplotlib>=3.1.3 | ||
lxml>=4.6.2 | ||
triqler @ git+ssh://[email protected]/statisticalbiotechnology/triqler.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having a non-PyPI dependency here, although it seems to make a large difference. I'll look into why this update isn't on PyPI yet.
Add cli max_workers test
It looks like tests are failing currently, because our GitHub Actions workflows are not setup to clone repos with ssh; hence, it fails to install mokapot with the changed Triqler depedency. I've reached out the Triqler folks to get an ETA on when the recent patch will be in a release. If it is relatively soon, we should revert the dependency change in |
That totally makes sense. We heard that @MatthewThe is on vacation currently and were just excited to get this out as quickly as possible ;) |
I updated Triqler on PyPI (v0.6.2), let me know if you run into any other issues. |
Thanks @MatthewThe! @sambenfredj and @gessulat - if you change the triqler dependency to |
Update triqler version
Thanks @wfondrie ! I just updated the triqler to the latest release. |
Codecov Report
@@ Coverage Diff @@
## master #44 +/- ##
==========================================
+ Coverage 82.57% 82.58% +0.01%
==========================================
Files 18 18
Lines 1544 1545 +1
==========================================
+ Hits 1275 1276 +1
Misses 269 269
Continue to review full report at Codecov.
|
Hey @wfondrie, |
I'm a bit stumped as to why the Ubuntu test hangs for hours. I'll try kicking it off again and see if it completes this time. Either way, I'll merge this into master. However, I'm going to hold off creating a new release if it doesn't work, so that I can troubleshoot it a little more. |
mokapot v0.8.0 is now on PyPI 🎉 It should be up on bioconda as well within the next few days. |
Hi @sambenfredj, @tkschmidt, and @gessulat - I'm planning to tweet about recent updates to things soon and I wanted to mention your contribution here. I can mention you each individually (@sambenfredj - do you have a twitter handle?), MSAID, or both - whatever your preference 😄. It's also fine if you don't want to be mentioned - please just let me know. |
Thank you @wfondrie, I am glad to be mentioned. You can find me under the twitter handle SamiaFredj. |
Hi @wfondrie, |
* expose max workers parameter to cli * update triqler package to fix the issue with calculating PEPs which is extremely slow * make grid search run with multithreading * test max_workers cli argument * fix format * update triqler Co-authored-by: Siegfried Gessulat <[email protected]> Co-authored-by: Tobias <[email protected]>
* Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass --------- Co-authored-by: Elmar Zander <[email protected]>
…ation (#119) * 💄 lint mokapot * 💄 lints tests * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 make ruff and black happy together * Fix problems with nnls * Feature/improve speed and limit memory (#11) Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models * 💄 linting (#12) :lipstick: fix linting * Fix bugs (#17) - fix bug member variables not assigned when model is not trained - allow throw when input file is malformed: remove skip on bad lines from pandas read function * fix test model: remove subset_max_train from percolator model (#18) * Fix test brew: (#20) - Create new object of OnDiskPsmDataset to use for brew tests - Update brew function outputs and assert statements * fix test datasets: (#19) - remove assign confidence tests because datasets don't have assign confidence methods anymore - add eval_fdr value to the _update_labels function * Fix test confidence (#22) * Fix test confidence: - fix bugs for grouped confidence - fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset. - remove test_pickle because confidence does not return dataframe results anymore. - add test_multi_groups to test that different group results are saved correctly. * fix bugs: - overwrite default fdr for update_labels function - return dataframe for psm_df_1000 to use with LinearPsmDataset * Fix cli tests: (#28) - Remove test_cli_pepxml because xml files don't work with streaming - Replace old output file names - Add random generator 'rng' variable to confidence since it is required for proteins - Remove subset_max_train from PluginModel - Fix bug: convert targets column after reading in chunks - Fix peptide column name for confidence - Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1 * Fix system tests: (#29) - Refactor test structure : Separate brew and confidence functions, read results from output files. - Fix bugs: fix output columns for proteins, sort proteins data by score * Fix parser pin test: (#30) - Add label value to initial direction because it has to have a numerical number - Read pin does not return dataframe anymore - Compare output of read_pin function to example dataframe * Add tests: (#31) - Add skip_deduplication flag test - Add ensemble flag test - Agg rescale flag test - Fix bug: remove target_column variable from read file for read_data_for_rescale * Fix writer tests: (#32) - Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence * fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse (#33) * Introduce new executable and bug fixes * Create new executable to aggregate psms to peptides. * Fix bugs: - fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse - raise error when pep values are all equal to 1 - prefixes paths to dest_dir to not pollute the workdir - catch error to prevent traces logged: Catch all errors to not break structured logging by error traces - fixes parallelism in parse_in_chunks to max_workers - fix indeterminism - fixed small column chunk bug - fix bug when using multiple input files * Fix and add tests: - remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence - add test for the new function "get_unique_peptides_from_psms" - add cli test for aggregatePsmsToPeptides * ✨ force ci re-run * 💄 lint mokapot * 💄 lints tests * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 make ruff and black happy together * ✨ removed deprecated error ignore * Fix two boolean conditions in nnls algorithm * Set tolerance to fixed value in fit_nnls to avoid non-convergence * Adjust unittest for hist_nnls to new error cases * Add documentation and test for create_chunks * Make cli unit tests for aggregateP2P easier debuggable * Improve test for peptide_csv in test_utils * Improve and test convert_targets_column function And fix some minor pep8 issues * Enable switching in system tests from subprocess to direct calls * Fix cli system and utils tests * Fix unit tests * Add documentation and test for create_chunks * Make cli unit tests for aggregateP2P easier debuggable * Improve test for peptide_csv in test_utils * Improve and test convert_targets_column function And fix some minor pep8 issues * Enable switching in system tests from subprocess to direct calls * Fix cli system and utils tests * Fix unit tests * parquet reader for mokapot * merge sort function adapted for parquet * brew function adapted for parquet input * confidence assignment modified for parquet format * merge sort chunk size added as constant * update label func modified for parquet * main function uses format arg to choose between csv and parquet * pyarrow added to dependancies * Change conversion of target column values Revert it to "old style" because otherwise some tests break * fixed failing tests * added new tests for parquet * refactor: Add type hinting to tuplize function * refactor: Refactor find_column(s) functions and use it in read_percolator * refactor: Insert some newlines for improved readability * refactor: Insert some newlines for improved readability * refactor: Insert some newlines for improved readability refactor: Insert some newlines for improved readability * refactor: Remove redundant test case for case-sensitive column matching. * Add typeguard * refactored unchunked file reader for parquet and csv * Add map_columns_to_indices function and more type checking * Make debugging dataframe issues easier in unit test by adding some pandas config options to pytest_sessionstart in conftest.py * Fix test_utils * Add level_columns to OnDiskPsmDataset * Rename deduplication to do_rollup * Change deduplication to do_rollup * Fix pin reading by adding level_columns * Revert "Merge branch 'feature/parquet_parser' into 'main'" This reverts merge request !39 * Move peptides.csv to dest_dir and remove it where it's created * Save changes * Get rid of path manipulation via strings * Clean up more path related stuff * Correct documentation of return values of brew function * Simplify and generalize path definitions in confidence.py * Move confidence related functions to confidence * Fix problem with parameter lists in cli tests * Add checking of column names for OnDiskPsmDataset * Fix column index stuff * Disable parallel unit tests when debugging * Refactor the confidence.to_txt function * Fix chunked reader to read in column order as passed to the function * Add comments and put temp file naming for merge-sorting in one place * Add test for chunked reader * Fix warning in read_file_in_chunks test * Remove unnecessary conversion (back and fro) in confidence.py * Add pyarrow as a dependency * Remove superfluous conversions and some more superfluous stuff * Improve find_column* and use consistently in pin parser * Introduce tabbed reader and writers (for csv for now) * Correct buggy import statement * Fix bug and add test related to --aggregate flag * Remove ignoring of warnings Warnings should not be ignored, except when the verbosity is set to error level. If there are annoying and known warnings, that are handled appropriately, they should be ignored locally on a case by case basis. * Comment regarding to_txt function (can be removed) * Add file type detection to readers and writers * Ignore warning in PIN reader (locally now) * Improve column ordering/mapping and add unit test * Correct targets conversion function and fix offending unit tests * Improve on label updating and type safety * Correct output capturing in test helper * Use TabbedWriter in save_sorted_metadata_chunks * Progress towards rollup * Improve map_columns_to_indices for dicts * Add stringify methods to tabbed readers * Change assign_confidence for rollup * Fix column ordering problem (temp) * Add tests for the rollup * Fix a bug in the rollup unit test * Rename pcms to precursors * Remove a now superfluous function and comments * squashed sqlite writer branch * failing test fixed * unused function get_unique_peptides_from_psms removed * format interpreted implicitly using filename * Confidencewriter class implemented * sqlite path changed to Path type * perquet module deleted and integrated into read_pin * failing tests fixed * instantiation done before returning object * PSM_PEP column name changed to POSTERIOR_ERROR_PROBABILITY * add pipeline status for main branch * ✨ fixes #54 * Do some cosmetics * Remove rescale stuff Was broken and unused anyway * Renamed TabbedFileReader and Writer to TabularDataReader and Writer * Separate general tabular data and confidence writer stuff * Fix import bug in confidence * Fix another import bug Due to pycharm refactoring... sigh... * Add better unit test for (chunked) confidence * Make chunked confidence unit test fail with small chunk size * Fix confidence chunk size bug * test case added for sqlite writer * test data added for sqlite writer * Add option for suppressing warnings * Revert the change in confidence and adapt unit test CONFIDENCE_CHUNK_SIZE is now directly modified from the unit test. * prepare tables sqlite db added as helper func * Fix problems with sqlit after the merge * Remove all group related stuff * Remove crosslink stuff * Remove plugins * Remove skipped tests and skip marks * Do some minor cleanup * Add type inference and tests for tabular data * Add reader and tests for in memory dataframe reader * Add streaming module * Add checks to merged reader * Add creation of dataframe reader from series and arrays * Add JoinedTabularData and tests * Add column renaming for tabular data readers * Add context manager to tabular data writer and make confidence writer a function * Get rid of all kinds of warnings during tests * Fix another warning * Add context manager to TabularDataWriter * Fix a problem with indexing in the merged reader * Add buffering to writers * Correct problem in unit test with log output and typechecking * Add functionality to add computed columns to TabularData * Add method to get an associated reader from a writer * Add cli and test for the rollup * Fix underscore problem * Fix problem with typechecked/contextmanager order * Remove typechecking from auto_finalizer for the moment * Fix path problem in rollup unit test * Show rollup levels not found only if non-empty * Simplify sqlite connection in unit tests * Add new suffixes for csv * Change options: add src_dir and remove keep_decoys * Add files for rollup testing * buffered write for parquet intermediary files implemented * tests updated for parquet writing * fixed aggregatePsmstoPeptides cli * test data structure changed to list of dicts to match merge sort output structure * test data updated to be dataframe readable * Remove aggregatePsmsToPeptides * Move remove_columns function to tabular_data * Fix program name in cli output * Let brew_rollup also search for parquet files * Use csv or parquet suffix also for temp and output files * Filter a warning in the system tests * Make the column types a bit more lenient * fixed rollup app for parquets * Fix unclosed files problem * Remove unused parameter target_column from merge_sort * Remove superfluous passing of sep * Change tabs to colons in protein(s) column of pin file * Test parquet merge_sort more extensively * Unify csv and parquet methods in merge_sort * Simplify get_row_iterator * Make brew rollup faster * Fix bug in MergedTabularReader * Fix problem with last line in buffering * Fix problem with type conversions in merge_sort (And make it a bit less ugly...) * ✨ addresses @jspaezp suggestions from PR #119 - removes MSAID internal status banner in README - removes @TypeChecked for brew() (we can reintroduce it when we have a typed version of brew() - uses yield from pattern - uses dictionaries to assert for correct column-name and type mappings - removes installation of sample-plugin from tests * ✨ addresses review suggestions - adds back typechecked for brew - uses parameterized tests for test_peps_hist_nnls * Add check for length of mokapot output file * Make test for output file length more "elastic" * Revert documentation on psms parameter for brew function. And add some additional info. * Change f-string to normal string where unnecessary * Remove types_from_dataframe function, since unnecessary * (chore) updated cicd,ruff,black and tests (#42) * (chore) updated cicd, ruff, black and tests * resolved comment question by @jspaezp * ✨ fix setting scores when training failed * Readd previously commented out check for feature columns * ✨ proper docstring for `sqlite_db_path` * Add doc strings for write_confidence (and improve types) * Improve documentation and type hints of assign_confidence * Add class and module documentation for the tabular data classes * Feature/remove nnls patch (#43) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks Co-authored-by: Elmar Zander <[email protected]> * Fix/windows tests (#44) * Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass --------- Co-authored-by: Elmar Zander <[email protected]> * Fix/windows tests (#45) * Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass * 💄 linting --------- Co-authored-by: Elmar Zander <[email protected]> * ✨ draft of pin to tsv converter * ✨ adds is_valid_tsv * ✨ adds tsv verification for pin files and conversion * 📝 remove print * 🔥 add required default for --dest_dir - using Mokapot without --dest_dir was broken without this new default before --------- Co-authored-by: Siegfried Gessulat <[email protected]> Co-authored-by: Elmar Zander <[email protected]> Co-authored-by: sambenfredj <[email protected]> Co-authored-by: Elmar Zander <[email protected]> Co-authored-by: Vishal Sukumar <[email protected]> Co-authored-by: Florian Seefried <[email protected]> Co-authored-by: Graber Michael <[email protected]> Co-authored-by: Tobias Schmidt <[email protected]> Co-authored-by: J. Sebastian Paez <[email protected]>