The mokapot cookbook
This page contains recipes—code snippets—to accomplish a variety of tasks with mokapot. The idea behind these examples is to illustrate the use of mokapot for common experimental designs provide starting points for conducting your own customized analyses.
Note
These recipes often make use of Python list comprehensions for brevity.
# This list comprehension:
y = [x for x in range(10)]
# Is equivalent to:
y = []
for x in range(10):
y.append(x)
Python API workflows
Analyze PSMs from a single file with mokapot
The simplest case is when we have a single file containing the PSMs identified
by a database search engine. If our file, psms.pin
, is in the
Percolator tab-delimited format, we can perform a mokapot analysis and save the
PSM and peptide confidence estimates as tab-delimited files with:
import mokapot
# Read the PSMs from the PIN file:
psms = mokapot.read_pin("psms.pin")
# Conduct the mokapot analysis:
results, models = mokapot.brew(psms)
# Save the results to two tab-delimited files
# "mokapot.psms.txt" and "mokapot.peptides.txt"
result_files = results.to_txt()
# Another way to save the results is:
result_files = mokapot.to_txt(results)
We can do same the same if our input PSMs are in the PepXML format,
psms.pep.xml
, as well:
import mokapot
# Read the PSMs from the PepXML file:
psms = mokapot.read_pepxml("psms.pep.xml")
# Conduct the mokapot analysis:
results, models = mokapot.brew(psms)
# Save the results to two tab-delimited files
# "mokapot.psms.txt" and "mokapot.peptides.txt"
result_files = results.to_txt()
Analyze PSMs from a single file using only the best feature
It is often good to determine if there is a significant benefit from using
mokapot’s machine learning approach. One way we can do this is by comparing the
results from a full mokapot analysis against the results from ranking PSMs by
the best feature (often the primary database search engine score). If our file,
psms.pin
, is in the Percolator tab-delimited format, we can use the
best feature, then save the PSM and peptide confidence estimates as
tab-delimited files with:
import mokapot
# Read the PSMs from the PIN file:
psms = mokapot.read_pin("psms.pin")
# Calculate confidence estimates using the best feature:
results = psms.assign_confidence()
# Save the results to two tab-delimited files
# "mokapot.psms.txt" and "mokapot.peptides.txt"
result_files = results.to_txt()
Analyze PSMs from a single file with protein-level results
We often want confidence estimates for proteins as well as PSMs and peptides.
In mokapot, we use the picked-protein approach to group proteins and assign
their confidence estimates. To enable these protein confidence estimates, we
need to provide the FASTA file and the digestion settings that we used for our
database search. If our file, psms.pin
, is in the Percolator
tab-delimited format and we obtained these PSMs using the human.fasta
protein database with a full tryptic digest, we can perform our analysis with:
import mokapot
# Read the PSMs from the PIN file:
psms = mokapot.read_pin("psms.pin")
# Provide the protein sequences:
psms.add_proteins(
"human.fasta",
enzyme="[KR]",
decoy_prefix="decoy_",
missed_cleavages=0,
)
# Conduct the mokapot analysis:
results, models = mokapot.brew(psms)
# Save the results to three tab-delimited files
# "mokapot.psms.txt", "mokapot.peptides.txt", and "mokapot.proteins.txt"
result_files = results.to_txt()
Analyze PSMs from a single fractionated sample in multiple files
Offline fractionation is typically performed to increase the detectable
proteome depth for a sample. Sometimes these types of analyses will yield
multiple files for the detected PSMs, each corresponding to a single mass
spectrometry run of the different biochemical fractions. If we have the PSMs
from three fractions, fraction_1.pin
, fraction_2.pin
, and
fraction_3.pin
, we can analyze them together in mokapot with:
import mokapot
# Create a list with our file names:
psm_files = ["fraction_1.pin", "fraction_2.pin", "fraction_3.pin"]
# Read the PSMs from all of the files:
psms = mokapot.read_pin(psm_files)
# Conduct the mokapot analysis:
results, models = mokapot.brew(psms)
# Save the results to two tab-delimited files
# "mokapot.psms.txt" and "mokapot.peptides.txt"
result_files = results.to_txt()
Analyze PSMs from multiple experiments using a joint model
We often want to compare the detected peptides and proteins between multiple
biological samples or experiments. One way to conduct this type of analysis
with mokapot is to use a joint model, such that the model learned by mokapot is
consistent across experiments. If we have PSMs from three experiments,
exp_1.pin
, exp_2.pin
, exp_3.pin
, we can analyze them
using a joint model with:
import mokapot
# Create a list with our file names:
psm_files = ["exp_1.pin", "exp_2.pin", "exp_3.pin"]
# Read the PSMs from each file separately:
psm_list = [mokapot.read_pin(f) for f in psm_files]
# Conduct the mokapot analysis:
result_list, models = mokapot.brew(psm_list)
# Save the results to two tab-delimited files for each experiment:
# "exp_1.mokapot.psms.txt", "exp_1.mokapot.peptides.txt", ...
labels = ["exp_1", "exp_2", "exp_3"]
result_files = [r.to_txt(file_root=l) for l, r in zip(labels, result_list)]
Analyze PSMs from multiple experiments using independent models
Like above, we can alternatively analyze PSMs from multiple experiments each
with their own model. If we have PSMs from three experiments,
exp_1.pin
, exp_2.pin
, exp_3.pin
, we can analyze them
using independent models with:
import mokapot
# Create a list with our file names:
psm_files = ["exp_1.pin", "exp_2.pin", "exp_3.pin"]
# Read the PSMs from each file separately:
psm_list = [mokapot.read_pin(f) for f in psm_files]
# Conduct the mokapot analyses separately:
# This returns a nested list: [[exp_1_result, exp_1_models], ...]
results_and_models = [mokapot.brew(p) for p in psm_list]
# Unnest the nested list:
result_list, models = list(zip(*results_and_models))
# Save the results to two tab-delimited files for each experiment:
# "exp_1.mokapot.psms.txt", "exp_1.mokapot.peptides.txt", ...
labels = ["exp_1", "exp_2", "exp_3"]
result_files = [r.to_txt(file_root=l) for l, r in zip(labels, result_list)]
Analyze PSMs from multiple experiments with multiple fractions
The previous cases of multiple experiments and multiple fractions are
frequently combined for deep proteomics datasets. Let’s assume we have PSMs
from two experiments, each with two fractions: exp_1-fraction_1.pin
,
exp_1-fraction_2.pin
, exp_2-fraction_1.pin
,
exp_2-fraction_2.pin
. We can then analyze them using a joint model in
mokapot with:
import mokapot
# Create a nested list with our file names:
psm_file_groups = [
["exp_1-fraction_1.pin", "exp_1-fraction_2.pin"], # exp_1
["exp_2-fraction_1.pin", "exp_2-fraction_2.pin"], # exp_2
]
# Read the PSMs from each experiment group separately:
psm_list = [mokapot.read_pin(f) for f in psm_file_groups]
# Conduct the mokapot analysis:
result_list, models = mokapot.brew(psm_list)
# Save the results to two tab-delimited files for each experiment:
# "exp_1.mokapot.psms.txt", "exp_1.mokapot.peptides.txt", ...
labels = ["exp_1", "exp_2"]
result_files = [r.to_txt(file_root=l) for l, r in zip(labels, result_list)]
Save results for label-free quantitation with FlashLFQ
FlashLFQ is an open-source
tool for label-free quantitation of peptides and proteins. Unfortunately, input
files in the Percolator tab-delimited format typically do not contain enough
information to create an input file for FlashLFQ. Although these can be added
to the file and specified through the optional parameters of
mokapot.read_pin()
, we find it often easier to use a PepXML file,
which already contains this information. If we have PSMs from two experiments,
exp_1.pep.xml
and exp_2.pep.xml
, we can analyze them with
mokapot using a joint model and save the detected peptides in a format for
input to FlashLFQ with the following. Note that the protein groups reported by
FlashLFQ will be most accurate if protein-level confidence estimates have been
enabled.
import mokapot
# Create a list with out file names:
psm_files = ["exp_1.pep.xml", "exp_2.pep.xml"]
# Read the PSMs from each experiment separately:
psm_list = [mokapot.read_pepxml(f) for f in psm_files]
# Read the proteins from a FASTA file and add them to each experiment:
proteins = mokapot.read_fasta("human.fasta")
[p.add_proteins(proteins) for p in psm_list]
# Conduct the mokapot analysis:
result_list, models = mokapot.brew(psm_list)
# Save results as tab-delimited files:
labels = ["exp_1", "exp_2"]
result_files = [r.to_txt(file_root=l) for l, r in zip(labels, result_list)]
# Create an input for FlashLFQ:
flashlfq_file = mokapot.to_flashlfq(result_list)
The final command will create a file mokapot.flashlfq.txt
that we can
use to obtain quantitative results for the peptides and proteins using
FlashLFQ.
Python API Tips and Tricks
Turn on logging messages
By default, mokapot will only print warnings and errors when using the Python API. However, information about mokapot’s progress can be enabled by adding the following to the beginning of your script or notebook:
import logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
When the same FASTA file is used for multiple experiments
One way to add proteins from a FASTA file (human.fasta
) to a collection
of PSMs (psms.pin
”) is:
import mokapot
psms = mokapot.read_pin("psms.pin")
psms.add_proteins("human.fasta")
Correspondingly, when we have PSMs from multiple experiments
(exp_1.pin
, exp_2.pin
, exp_3.pin
), we could do:
import mokapot
psm_files = ["exp_1.pin", "exp_2.pin", "exp_3.pin"]
psm_list = [mokapot.read_pin(f) for f in psm_files]
# Add proteins to each independently:
[p.add_proteins("human.fasta") for p in psm_list]
This will work; however, the protein and peptide sequences from
human.fasta
will be stored 3 separate times, despite containing the
same information.
Instead, it is much more memory efficient to use
mokapot.read_fasta()
once and add the resulting
Proteins
object to each of the experiments:
import mokapot
psm_files = ["exp_1.pin", "exp_2.pin", "exp_3.pin"]
psm_list = [mokapot.read_pin(f) for f in psm_files]
# Read the FASTA file:
proteins = mokapot.read_fasta("human.fasta")
# Add the proteins to the experiments:
[p.add_proteins(proteins) for p in psm_list]
Enzyme Regular Expressions
For maximum flexibility, mokapot uses regular expressions (regex for short) to define the patterns that govern enzymatic cleavage in protein sequences. However, it can be frustratingly difficult to write from scratch. In the table below, we list regular expressions for some common enzymes used in proteomics experiments. If the one you need is not listed, we recommend using mokapot.digest() to test a new one on a sample sequence.
In mokapot, the end of the sequence matching the regex is used to define the cleavage site.
Enzyme |
Regex |
---|---|
Trypsin (without proline suppression) |
|
Trypsin (with proline suppression) |
|
Lys-C |
|
Lys-N |
|
Arg-C |
|
Asp-N |
|
CNBr |
|
Glu-C |
|
PepsinA |
|
Chymotrypsin |
|
To indicate more than one enzyme, we can use regex alternations with
|
. For example, we could specify trypsin and chymotrypsin with:
"([KR](?!P)|[FWYL](?!P))"
In this case, we also have the option to simplify the regex to:
"[KRFWYL](?!P)"
Questions?
Still have questions? Post them on our discussion board.
If you find a mistake—such as a typo or code that doesn’t run—please let us know by filing an issue. Also, please consider contributing if you know how to fix it.