First steps using mokapot in Python

In this vignette, we’ll look at the basics of how to use mokapot as a Python package. We’ve performed these analyses within a Jupyter notebook, which is available using the link at the top of the page.

Following along locally

To run this notebook, you’ll need to have mokapot installed. Additionally, you’ll need to have a file in the Percolator tab-delimited format on hand. The example we’ll be using comes from running tide-search on a single phosphoproteomics experiment from:

Hogrebe, Alexander et al. “Benchmarking common quantification strategies for large-scale phosphoproteomics.” Nature communications vol. 9,1 1045. 13 Mar. 2018, doi:10.1038/s41467-018-03309-6

If you need it, you can download it from the mokapot repository here (phospho_rep1.pin) and set the path to your input file:

[1]:

pin_file = "../../../data/phospho_rep1.pin"

Additionally, we’ll need the FASTA file used for the database search to perform protein-level confidence estimates. Critically, this file must include both target and decoy protein sequences. The correct FASTA file for the above example can also be downloaded from the mokapot repository (human_sp_td.fasta). Once downloaded, you can define the path to your FASTA file to follow along:

[2]:

fasta_file = "../../../data/human_sp_td.fasta"

Step 1: Setup our Python environment

Before we can perform an anlyses we need to import the Python packages that we’ll be using. Additionally, it’s a good idea to set the random seed for reproducibility.

[3]:

import os
import mokapot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set the random seed:
np.random.seed(42)

# Create an output directory
out_dir = "basic_python_api_output"
os.makedirs(out_dir, exist_ok=True)

If we want messages about the mokapot’s progress throughout the analysis, then we need to enable it using the logging module:

[4]:

import logging

# Change to True enable messages and nicely format them:
log = False
if log:
    logging.basicConfig(
        level=logging.INFO,
        format="%(levelname)s: %(message)s",
    )

Step 2: Read the PSMs

We’ll now use mokapot to read the PSMs from the provided input file. The read_pin() function returns LinearPsmDataset object, which stores the PSMs and their associated features for analysis.

[5]:

psms = mokapot.read_pin(pin_file)

Optional: Add proteins

mokapot uses the picked-protein approach to assign accurate protein-level confidence estimates. To do this, you’ll need to provide the FASTA file used for your database search (including decoy sequences) and supply the parameters that match the digestion conditions you searched.

[6]:

psms.add_proteins(fasta_file)

Step 3: Analyze the PSMs

After that the PSMs have been loaded, we can use the brew() function to run the analysis and re-score the PSMs. This returns a LinearConfidence object, which calculates confidences estimates and stores them.

[7]:

moka_conf, models = mokapot.brew(psms)
moka_conf.psms.head()

[7]:

	SpecId	Label	ScanNr	ExpMass	CalcMass	Peptide	mokapot score	mokapot q-value	mokapot PEP	Proteins
0	target_0_48845_5_-1	True	48845	5269.5756	5269.5728	R.RGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHT...	11.215498	0.000053	6.305117e-16	sp\|P68104\|EF1A1_HUMAN
1	target_0_45243_4_-1	True	45243	3945.8759	3945.8706	R.CSDAAGYPHATHDLEGPPLDAYSIQGQHTISPLDLAK.L	10.601063	0.000053	6.305117e-16	sp\|Q15365\|PCBP1_HUMAN
2	target_0_51371_4_-1	True	51371	4051.1223	4051.1086	K.KLHEEEIQELQAQIQEQHVQIDVDVSKPDLTAALR.D	10.550855	0.000053	6.305117e-16	sp\|P08670\|VIME_HUMAN
3	target_0_41715_3_-1	True	41715	4473.8359	4473.8286	K.ALGKYGPADVEDTTGSGATDSKDDDDIDLFGS[79.97]DDEEE...	9.964699	0.000053	6.305117e-16	sp\|P24534\|EF1B_HUMAN
4	target_0_48913_5_-1	True	48913	5269.5737	5269.5728	R.RGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHT...	9.874374	0.000053	6.305117e-16	sp\|P68104\|EF1A1_HUMAN

[8]:

moka_conf.peptides.head()

[8]:

	SpecId	Label	ScanNr	ExpMass	CalcMass	Peptide	mokapot score	mokapot q-value	mokapot PEP	Proteins
0	target_0_48845_5_-1	True	48845	5269.5756	5269.5728	R.RGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHT...	11.215498	0.000073	6.305117e-16	sp\|P68104\|EF1A1_HUMAN
1	target_0_45243_4_-1	True	45243	3945.8759	3945.8706	R.CSDAAGYPHATHDLEGPPLDAYSIQGQHTISPLDLAK.L	10.601063	0.000073	6.305117e-16	sp\|Q15365\|PCBP1_HUMAN
2	target_0_51371_4_-1	True	51371	4051.1223	4051.1086	K.KLHEEEIQELQAQIQEQHVQIDVDVSKPDLTAALR.D	10.550855	0.000073	6.305117e-16	sp\|P08670\|VIME_HUMAN
3	target_0_41715_3_-1	True	41715	4473.8359	4473.8286	K.ALGKYGPADVEDTTGSGATDSKDDDDIDLFGS[79.97]DDEEE...	9.964699	0.000073	6.305117e-16	sp\|P24534\|EF1B_HUMAN
4	target_0_31886_3_-1	True	31886	3450.7612	3450.7544	R.AAAAVAAAASSCRPLGSGAGPGPTGAAPVSAPAPGPGPAGK.G	9.778672	0.000073	6.305117e-16	sp\|Q9NRL3\|STRN4_HUMAN

[9]:

moka_conf.proteins.head()

[9]:

	mokapot protein group	best peptide	stripped sequence	mokapot score	mokapot q-value	mokapot PEP
0	sp\|P68104\|EF1A1_HUMAN	R.RGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHT...	RGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAH...	11.215498	0.000291	6.305117e-16
1	sp\|Q15365\|PCBP1_HUMAN	R.CSDAAGYPHATHDLEGPPLDAYSIQGQHTISPLDLAK.L	CSDAAGYPHATHDLEGPPLDAYSIQGQHTISPLDLAK	10.601063	0.000291	6.305117e-16
2	sp\|P08670\|VIME_HUMAN	K.KLHEEEIQELQAQIQEQHVQIDVDVSKPDLTAALR.D	KLHEEEIQELQAQIQEQHVQIDVDVSKPDLTAALR	10.550855	0.000291	6.305117e-16
3	sp\|P24534\|EF1B_HUMAN	K.ALGKYGPADVEDTTGSGATDSKDDDDIDLFGS[79.97]DDEEE...	ALGKYGPADVEDTTGSGATDSKDDDDIDLFGSDDEEESEEAK	9.964699	0.000291	6.305117e-16
4	sp\|Q9NRL3\|STRN4_HUMAN	R.AAAAVAAAASSCRPLGSGAGPGPTGAAPVSAPAPGPGPAGK.G	AAAAVAAAASSCRPLGSGAGPGPTGAAPVSAPAPGPGPAGK	9.778672	0.000291	6.305117e-16

We can also use mokapot assign confidence estimates based on the best original feature—the Tide combined p-value in this case—instead of using the learned scores:

[10]:

tide_conf = psms.assign_confidence()

Step 4: Plot and save the results

We’ve included some simple plotting utilities to help assess mokapot’s perfomance. Let’s see if the mokapot model improves upon the best feature, which is Tide’s combined p-value:

[11]:

fig, axs = plt.subplots(1, 3, figsize=(12, 4))
colors = ("#343131", "#24B8A0")

# Plot the performance:
for ax, level in zip(axs, tide_conf.levels):
    tide_conf.plot_qvalues(level=level, c=colors[0], ax=ax,
                           label="Tide combined p-value")
    moka_conf.plot_qvalues(level=level, c=colors[1], ax=ax,
                           label="mokapot")
    ax.legend(frameon=False)

plt.tight_layout()
plt.show()

../_images/vignettes_basic_python_api_19_0.png

Excellent. It looks like mokapot increased our power to detect a few hundred more PSMs and peptides at 1% FDR. We figure out the exact improvement by looking at the data:

[12]:

# PSMs
moka_psms = (moka_conf.psms["mokapot q-value"] <= 0.01).sum()
tide_psms = (tide_conf.psms["mokapot q-value"] <= 0.01).sum()
print(f"PSMs gained by mokapot: {moka_psms - tide_psms}")

# Peptides
moka_peps = (moka_conf.peptides["mokapot q-value"] <= 0.01).sum()
tide_peps = (tide_conf.peptides["mokapot q-value"] <= 0.01).sum()
print(f"Peptides gained by mokapot: {moka_peps - tide_peps}")

# Proteins
moka_prots = (moka_conf.proteins["mokapot q-value"] <= 0.01).sum()
tide_prots = (tide_conf.proteins["mokapot q-value"] <= 0.01).sum()
print(f"Proteins gained by mokapot: {moka_prots - tide_prots}")

PSMs gained by mokapot: 1149
Peptides gained by mokapot: 872
Proteins gained by mokapot: 89

Finally, we will save the results as tab-delimited text files:

[13]:

result_files = moka_conf.to_txt(dest_dir=out_dir)
result_files

[13]:

['basic_python_api_output/mokapot.psms.txt',
 'basic_python_api_output/mokapot.peptides.txt',
 'basic_python_api_output/mokapot.proteins.txt']

Wrapping up

This vignette demonstrated the basic usage of mokapot’s Python API. For more detail about any of the mokapot functions and classes that we used, see the mokapot Python API documentation. Finally, check out the other vignettes for examples of advanced usage of mokapot’s Python API.