Merge pull request #23 from compomics/various-fixes

Various fixes and improvements (v0.2.0)
compomics · Nov 12, 2022 · b231291 · b231291
2 parents 986b696 + 9d7cf78
commit b231291
Show file tree

Hide file tree

Showing 32 changed files with 536 additions and 166 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -1,5 +1,5 @@
 {
-    "esbonio.sphinx.confDir": "docs/source",
+    "esbonio.sphinx.confDir": "${workspaceFolder}/docs/source",
     "[python]": {
         "editor.formatOnSave": true,
         "editor.codeActionsOnSave": {"source.organizeImports": true},
@@ -11,4 +11,4 @@
     ],
     "python.testing.unittestEnabled": false,
     "python.testing.pytestEnabled": true
-}
+}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,9 +1,59 @@
 # Changelog
+
 All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.2.0] - 2022-11-12
+
+### Added
+
+- `PSM`: Add `ion_mobility` field
+- `PSMList`: Allow slicing with bool arrays (e.g., `psm_df[psm_df["retention_time"] < 2000]`)
+- `rename_modifications`: Add support for fixed modifications
+- Add example files
+- Online: Add support for GZipped files
+- Online: Add support for logarithmic score (e.g. e-values)
+- Docs: Extend contributing with example contributions
+- Docs: Add notes to `PSM.get_usi()` method
+- Docs: Extend quickstart on PSMList
+- Docs: Add "psm_utils tags" for file formats, as used in high-level read/write/convert functions
+- Docs: Peptide Record: add notes on unsupported modification types; add example for C-terminal modification
+- Docs: More clearly document conversion to DataFrame
+- Docs: Add bioconda install instructions
+- Docs: Add citation for preprint
+- Tests: Added tests for PSMList `set_ranks` and `get_rank1_psms` methods
+
+### Changed
+
+- `PSMList`: Refactor `set_ranks` and `get_rank1_psms` methods
+- Update `.vscode/settings.json`
+- Typing: Replace Union with OR operator `|`
+- Online: Use percentiles instead of randomly sampling for PP plot
+- Docs: Force TOC-tree max depth
+- Tests: Expand unit tests in general
+
+### Fixed
+
+- `PSMList`: Truncate __repr__ to first five entries only, avoiding crashing notebook output
+- `Peptidoform`: Minor typing fix
+- `add_fixed_modifications`: Allow input as dict as well as list of tuples
+- `io`: Fix issue where the `NamedTemporaryFile` for `_supports_write_psm` was seen as invalid Percolator file
+- `io.convert`: pass ` progressbar` argument to class, not `write_file`
+- `io.mzid`: Add more supported MS-GF score names, make SpecEValue default
+- `io.peptide_record`: `spec_id` is now a required column (`spectrum_id` is also required in PSM)
+- `io.peptide_record`: Fix parsing of C-terminal modifications from proforma to peprec
+- `io.percolator`: Fix Percolator peptide notation writing (fixes #18)
+- `io.tsv`: Fix issue where `TSVReader` would not use string type for metadata
+- `io.xtandem`: Fix issue where optional arguments were not accepted by `XTandemReader`
+- `io.xtandem`: Do not split spectrum title on space
+- `io.xtandem`: Fix issue where optional arguments were not accepted by `XTandemReader`
+- Online: Fix pi-0 diagonal calculation
+- Remove obsolete to do comments in code
+
 ## [0.1.0] - 2022-10-14
+
 ### Added
+
 - Initial version
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -16,6 +16,16 @@ thread in
 `GitHub Discussions <https://github.com/compomics/psm_utils/discussions>`_
 or in `GitHub Issues <https://github.com/compomics/psm_utils/issues>`_.
 
+Not sure where to start? Great contributions to
+`psm_utils <https://github.com/compomics/psm_utils>`_ include:
+
+- Adding support for more file formats.
+- Adding functionality to the ``PSMList``, ``PSM``, or ``Peptidoform`` classes.
+- Improving the performance of existing functions, e.g. parsing peptidoforms or
+  reading and writing PSM files.
+
+Also check out the `open issues <https://github.com/compomics/psm_utils/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22>`_
+that carry the ``good first issue`` or ``help wanted`` labels.
 
 
 Development setup

diff --git a/README.rst b/README.rst
@@ -14,6 +14,10 @@ engine results in Python.
    :alt: PyPI
    :target: https://pypi.org/project/psm-utils/
 
+.. image:: https://img.shields.io/conda/vn/bioconda/psm-utils?style=flat-square
+   :alt: Bioconda
+   :target: http://bioconda.github.io/recipes/psm-utils/README.html
+
 .. image:: https://img.shields.io/github/workflow/status/compomics/psm_utils/Test?label=tests&style=flat-square
    :alt: GitHub Actions tests status
    :target: https://github.com/compomics/psm_utils/actions/workflows/test.yml
@@ -82,18 +86,18 @@ Goals and non-goals
 Supported file formats
 **********************
 
-===================================================================================================================== =============== ===============
- File format                                                                                                           Read support    Write support
-===================================================================================================================== =============== ===============
- `OpenMS idXML <https://www.openms.de/>`_                                                                              ✅              ❌
- `MaxQuant msms.txt <https://www.maxquant.org/>`_                                                                      ✅              ❌
- `MS Amanda CSV <https://ms.imp.ac.at/?goto=msamanda>`_                                                                ✅              ❌
- `mzIdentML <https://psidev.info/mzidentml>`_                                                                          ✅              ✅
- `Peptide Record <https://psm-utils.readthedocs.io/en/latest/api/psm_utils.io/#module-psm_utils.io.peptide_record>`_   ✅              ✅
- `Percolator tab <https://github.com/percolator/percolator/wiki/Interface>`_                                           ✅              ✅
- `TSV <https://psm-utils.readthedocs.io/en/latest/api/psm_utils.io/#module-psm_utils.io.tsv>`_                         ✅              ✅
- `X!Tandem XML <https://www.thegpm.org/tandem/>`_                                                                      ✅              ❌
-===================================================================================================================== =============== ===============
+===================================================================================================================== =============== =============== ===============
+ File format                                                                                                           psm_utils tag   Read support    Write support
+===================================================================================================================== =============== =============== ===============
+ `OpenMS idXML <https://www.openms.de/>`_                                                                              ``idxml``       ✅              ❌
+ `MaxQuant msms.txt <https://www.maxquant.org/>`_                                                                      ``msms``        ✅              ❌
+ `MS Amanda CSV <https://ms.imp.ac.at/?goto=msamanda>`_                                                                ``msamanda``    ✅              ❌
+ `mzIdentML <https://psidev.info/mzidentml>`_                                                                          ``mzid``        ✅              ✅
+ `Peptide Record <https://psm-utils.readthedocs.io/en/stable/api/psm_utils.io/#module-psm_utils.io.peptide_record>`_   ``peprec``      ✅              ✅
+ `Percolator tab <https://github.com/percolator/percolator/wiki/Interface>`_                                           ``percolator``  ✅              ✅
+ `TSV <https://psm-utils.readthedocs.io/en/stable/api/psm_utils.io/#module-psm_utils.io.tsv>`_                         ``tsv``         ✅              ✅
+ `X!Tandem XML <https://www.thegpm.org/tandem/>`_                                                                      ``xtandem``     ✅              ❌
+===================================================================================================================== =============== =============== ===============
 
 Legend: ✅ Supported, ❌ Unsupported, 🔧 Work in progress
 
@@ -117,15 +121,22 @@ above to get started!
 Installation
 ############
 
-Install with pip:
+.. image:: https://img.shields.io/badge/install%20with-pip-brightgreen?style=flat-square
+   :alt: Install with pip
+   :target: https://pypi.org/project/psm-utils/
 
 .. code-block:: sh
 
     pip install psm-utils
 
 
-Note: In the PyPI package name, a hyphen is used instead of an underscore, as
-per `PEP8 convention <https://peps.python.org/pep-0008/#package-and-module-names>`_.
+.. image:: https://img.shields.io/badge/install%20with-bioconda-blue?style=flat-square
+   :alt: Install with Bioconda
+   :target: http://bioconda.github.io/recipes/psm-utils/README.html
+
+.. code-block:: sh
+
+    conda install -c bioconda install psm-utils
 
 
 
@@ -134,3 +145,13 @@ Full documentation
 
 The full documentation, including a quickstart guide and Python API reference
 is available on `psm_utils.readthedocs.io <https://psm-utils.readthedocs.io>`_.
+
+
+Citation
+########
+
+If you use psm_utils for your research, please cite the following publication:
+
+   | **psm_utils: A high level Python API for parsing and handling peptide-spectrum-matches and proteomics search results.**
+   | Ralf Gabriels, Arthur Declercq, Robbin Bouwmeester, Sven Degroeve, Lennart Martens.
+   | ChemRxiv (2022). `doi:10.26434/chemrxiv-2022-9506r <https://doi.org/10.26434/chemrxiv-2022-9506r>`_
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -5,6 +5,7 @@
    :caption: About
    :hidden:
    :includehidden:
+   :maxdepth: 2
 
    Readme <self>
    quickstart
@@ -17,6 +18,7 @@
    :hidden:
    :includehidden:
    :glob:
+   :maxdepth: 2
 
    api/*
 
@@ -25,5 +27,6 @@
    :caption: Command line interface
    :hidden:
    :includehidden:
+   :maxdepth: 2
 
    cli/cli
diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -60,10 +60,68 @@ match information:
 
 The spectrum can be retrieved by the USI through the ProteomeXchange USI aggregator:
 http://proteomecentral.proteomexchange.org/usi/?usi=mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2
+Note that this is only possible because the spectrum has been fully indexed in one of
+the ProteomeXchange partner repositories (in this case both MassIVE and PeptideAtlas).
 
 
-PSMList and psm_utils.io
-########################
+PSMList
+#######
+
+:py:class:`~psm_utils.psm.PSMList` is a simple list-like object that represents a
+group of PSMs, from one or more mass spectrometry runs or collections. This simple,
+Pythonic data structure can be flexibly implemented in various contexts.
+
+.. code-block:: python
+
+   >>> psm_list = PSMList(psm_list=[
+   ...     PSM(peptidoform="ACDK", spectrum_id=1, score=140.2, retention_time=600.2),
+   ...     PSM(peptidoform="CDEFR", spectrum_id=2, score=132.9, retention_time=1225.4),
+   ...     PSM(peptidoform="DEM[Oxidation]K", spectrum_id=3, score=55.7, retention_time=3389.1),
+   ... ])
+
+:py:class:`PSMList` directly supports iteration:
+
+.. code-block:: python
+
+   >>> for psm in psm_list:
+   ...     print(psm.peptidoform.score)
+   140.2
+   132.9
+   55.7
+
+:py:class:`PSM` properties can be accessed as a single Numpy array:
+
+.. code-block:: python
+
+   >>> psm_list["score"]
+   array([140.2, 132.9, 55.7], dtype=object)
+
+:py:class:`PSMList` supports indexing and slicing:
+
+.. code-block:: python
+
+   >>> psm_list_subset = psm_list[0:2]
+   >>> psm_list_subset["score"]
+   array([140.2, 132.9], dtype=object)
+
+   >>> psm_list_subset = psm_list[0, 2]
+   >>> psm_list_subset["score"]
+   array([140.2, 55.7], dtype=object)
+
+For more advanced and efficient vectorized access, converting the
+:py:class:`PSMList` to a Pandas DataFrame is highly recommended:
+
+.. code-block:: python
+
+   >>> psm_df = psm_list.to_dataframe()
+   >>> psm_df[(psm_df["retention_time"] < 2000) & (psm_df["score"] > 10)]
+      peptidoform  spectrum_id   run collection spectrum is_decoy  score qvalue   pep precursor_mz  retention_time protein_list  rank source provenance_data metadata rescoring_features
+   0        ACDK            1  None       None     None     None  140.2   None  None         None           600.0         None  None   None            None     None               None
+   1       CDEFR            2  None       None     None     None  132.9   None  None         None          1225.0         None  None   None            None     None               None
+
+
+psm_utils.io
+############
 
 The :py:mod:`psm_utils.io` subpackage contains readers and writers for various
 PSM file formats (see :ref:`Supported file formats`). Each reader parses the
@@ -173,7 +231,7 @@ Using a molecular formula or mass shift:
 621.24282637892
 
 
-A drawback of using the mass shift is that the composition is not be resolvable:
+A drawback of using the mass shift is that the composition is not resolvable:
 
 >>> Peptidoform("AC[+57.021464]DEK/2").composition
 [...]

diff --git a/example_files/HeLa_1ug_isow2_1hgradient_datasetA_rep1_output_2.csv.gz b/example_files/HeLa_1ug_isow2_1hgradient_datasetA_rep1_output_2.csv.gz
diff --git a/...es/PXD001077-Velos005137_swissprot-pfuriosus-canonical-crap_msgfplus-phospho.msgf.mzid.gz b/...es/PXD001077-Velos005137_swissprot-pfuriosus-canonical-crap_msgfplus-phospho.msgf.mzid.gz
diff --git a/...os005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho-ms2rescore.pout.gz b/...os005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho-ms2rescore.pout.gz
diff --git a/...Velos005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho.t.peprec.txt.gz b/...Velos005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho.t.peprec.txt.gz
diff --git a/...001077-Velos005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho.t.xml.gz b/...001077-Velos005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho.t.xml.gz
diff --git a/...77-Velos005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho.t.xml.tsv.gz b/...77-Velos005137_swissprot-pfuriosus-canonical-crap_xtandem-refinement-phospho.t.xml.tsv.gz
diff --git a/example_files/QExHF04054_tandem.idXML.gz b/example_files/QExHF04054_tandem.idXML.gz
diff --git a/example_files/msms.txt.gz b/example_files/msms.txt.gz
diff --git a/online/_base.py b/online/_base.py
@@ -26,6 +26,7 @@ def __init__(self) -> None:
             "output_filetype",
             "fdr_threshold",
             "reverse",
+            "log_scale",
             "file_state",
             "psm_list",
             "psm_df",

diff --git a/online/_utils.py b/online/_utils.py
@@ -57,15 +57,21 @@ def score_histogram(psm_df):
 
 def pp_plot(psm_df):
     """Generate PP plot for given PSM dataframe."""
-    decoy_ratio = np.count_nonzero(psm_df["is_decoy"]) / len(psm_df)
-    if decoy_ratio == 0:
+    n_decoys = np.count_nonzero(psm_df["is_decoy"])
+    n_targets = len(psm_df) - n_decoys
+    pi_zero = n_decoys / n_targets
+    if n_decoys == 0:
         raise ValueError("No decoy PSMs found in PSM file.")
     target_scores = psm_df["score"][~psm_df["is_decoy"]]
     decoy_scores = psm_df["score"][psm_df["is_decoy"]]
-    if len(psm_df) > 5000:
-        target_scores_sample = psm_df["score"][~psm_df["is_decoy"]].sample(5000)
-    target_ecdf = ECDF(target_scores)(target_scores_sample)
-    decoy_ecdf = ECDF(decoy_scores)(target_scores_sample)
+    if len(psm_df) > 1000:
+        target_scores_quantiles = psm_df["score"][~psm_df["is_decoy"]].quantile(
+            np.linspace(0, 1, 1000)
+        )
+    else:
+        target_scores_quantiles = target_scores
+    target_ecdf = ECDF(target_scores)(target_scores_quantiles)
+    decoy_ecdf = ECDF(decoy_scores)(target_scores_quantiles)
 
     fig = go.Figure()
     fig.add_trace(
@@ -78,7 +84,7 @@ def pp_plot(psm_df):
     fig.add_trace(
         go.Scatter(
             x=[0, 1],
-            y=[0, decoy_ratio],
+            y=[0, pi_zero],
             mode="lines",
             line=go.scatter.Line(color="red"),
             showlegend=True,