Functions

Primary Functions

mokapot.read_pin(pin_files, group_column=None, filename_column=None, calcmass_column=None, expmass_column=None, rt_column=None, charge_column=None, to_df=False, copy_data=False)[source]

Read Percolator input (PIN) tab-delimited files.

Read PSMs from one or more Percolator input (PIN) tab-delmited files, aggregating them into a single LinearPsmDataset. For more details about the PIN file format, see the Percolator documentation.

Specifically, mokapot requires specific columns in the tab-delmited files: specid, scannr, peptide, proteins, and label. Note that these column names are case insensitive. In addition to these special columns defined for the PIN format, mokapot also looks for additional columns that specify the MS data file names, theoretical monoisotopic peptide masses, the measured mass, retention times, and charge states, which are necessary to create specific output formats for downstream tools, such as FlashLFQ.

In addition to PIN tab-delimited files, the pin_files argument can be a pandas.DataFrame containing the above columns.

Finally, mokapot does not currently support specifying a default direction or feature weights in the PIN file itself. If these are present, they will be ignored.

Parameters:

pin_filesstr, tuple of str, or pandas.DataFrame: One or more PIN files to read or a pandas.DataFrame.
group_columnstr, optional: A factor to by which to group PSMs for grouped confidence estimation.
filename_columnstr, optional: The column specifying the MS data file. If None, mokapot will look for a column called “filename” (case insensitive). This is required for some output formats, such as FlashLFQ.
calcmass_columnstr, optional: The column specifying the theoretical monoisotopic mass of the peptide including modifications. If None, mokapot will look for a column called “calcmass” (case insensitive). This is required for some output formats, such as FlashLFQ.
expmass_columnstr, optional: The column specifying the measured neutral precursor mass. If None, mokapot will look for a column call “expmass” (case insensitive). This is required for some output formats.
rt_columnstr, optional: The column specifying the retention time in seconds. If None, mokapot will look for a column called “ret_time” (case insensitive). This is required for some output formats, such as FlashLFQ.
charge_columnstr, optional: The column specifying the charge state of each peptide. If None, mokapot will look for a column called “charge” (case insensitive). This is required for some output formats, such as FlashLFQ.
to_dfbool, optional: Return a pandas.DataFrame instead of a LinearPsmDataset.
copy_databool, optional: If true, a deep copy of the data is created. This uses more memory, but is safer because it prevents accidental modification of the underlying data. This argument only has an effect when pin_files is a pandas.DataFrame

Returns:

LinearPsmDataset: A LinearPsmDataset object containing the PSMs from all of the PIN files.

mokapot.read_pepxml(pepxml_files, decoy_prefix='decoy_', exclude_features=None, open_modification_bin_size=None, to_df=False)[source]

Read PepXML files.

Read peptide-spectrum matches (PSMs) from one or more pepxml files, aggregating them into a single LinearPsmDataset.

Specifically, mokapot will extract the search engine scores as a set of features (found under the search_scores tag). Additionally, mokapot will add the peptide lengths, mass error, the number of enzymatic termini and the number of missed cleavages as features.

Parameters:

pepxml_filesstr or tuple of str: One or more PepXML files to read.
decoy_prefixstr, optional: The prefix used to indicate a decoy protein in the description lines of the FASTA file.
exclude_featuresstr or tuple of str, optional: One or more features to exclude from the dataset. This is useful in the case that a search engine score may be biased again decoy PSMs/CSMs.
open_modification_bin_sizefloat, optional: If specified, modification masses are binned according to the value. The binned mass difference is appended to the end of the peptide and will be used when grouping peptides for peptide-level confidence estimation. Use this option for open modification search results. We recommend 0.01 as a good starting point.
to_dfbool, optional: Return a pandas.DataFrame instead of a LinearPsmDataset.

Returns:

LinearPsmDataset or pandas.DataFrame: A LinearPsmDataset or pandas.DataFrame containing the parsed PSMs.

mokapot.read_fasta(fasta_files, enzyme='[KR]', missed_cleavages=2, clip_nterm_methionine=False, min_length=6, max_length=50, semi=False, decoy_prefix='decoy_')[source]

Parse a FASTA file, storing a mapping of peptides and proteins.

Protein sequence information from the FASTA file is required to compute protein-level confidence estimates using the picked-protein approach. Decoys proteins must be included and must be of the have a description in format of <prefix><protein ID> for valid confidence estimates to be calculated.

If you need to generate an appropriate FASTA file with decoy sequences for your database search, see mokapot.make_decoys().

Importantly, the parameters below should match the conditions in which the PSMs were assigned as closely as possible. Enzyme specificity is provided using a regular expression. A table of common enzymes can be found here in the mokapot cookbook.

Parameters:

fasta_filesstr or tuple of str: The FASTA file(s) used for assigning the PSMs
decoy_prefixstr, optional: The prefix used to indicate a decoy protein in the description lines of the FASTA file.
enzymestr or compiled regex, optional: A regular expression defining the enzyme specificity was used when assigning PSMs. The cleavage site is interpreted as the end of the match. The default is trypsin, without proline suppression: “[KR]”.
missed_cleavagesint, optional: The allowed number of missed cleavages.
clip_nterm_methioninebool, optional: Remove methionine residues that occur at the protein N-terminus.
min_lengthint, optional: The minimum peptide length to consider.
max_lengthint, optional: The maximum peptide length to consider.
semibool, optional: Was a semi-enzymatic digest used to assign PSMs? If True, the protein database will likely contain many shared peptides and yield unhelpful protein-level confidence estimates.

Returns:

Proteins object: The parsed proteins as a Proteins object.

mokapot.brew(psms, model=None, test_fdr=0.01, folds=3, max_workers=1, rng=None)[source]

Re-score one or more collection of PSMs.

The provided PSMs analyzed using the semi-supervised learning algorithm that was introduced by Percolator. Cross-validation is used to ensure that the learned models to not overfit to the PSMs used for model training. If a multiple collections of PSMs are provided, they are aggregated for model training, but the confidence estimates are calculated separately for each collection.

A list of previously trained models can be provided to the model argument to rescore the PSMs in each fold. Note that the number of models must match folds. Furthermore, it is valid to use the learned models on the same dataset from which they were trained, but they must be provided in the same order, such that the relationship of the cross-validation folds is maintained.

Parameters:

psmsPsmDataset object or list of PsmDataset objects: One or more collections of PSMs objects. PSMs are aggregated across all of the collections for model training, but the confidence estimates are calculated and returned separately.
model: Model object or list of Model objects, optional: The mokapot.Model object to be fit. The default is None, which attempts to mimic the same support vector machine models used by Percolator. If a list of mokapot.Model objects is provided, they are assumed to be previously trained models and will and one will be used to rescore each fold.
test_fdrfloat, optional: The false-discovery rate threshold at which to evaluate the learned models.
foldsint, optional: The number of cross-validation folds to use. PSMs originating from the same mass spectrum are always in the same fold.
max_workersint, optional: The number of processes to use for model training. More workers will require more memory, but will typically decrease the total run time. An integer exceeding the number of folds will have no additional effect. Note that logging messages will be garbled if more than one worker is enabled.
rngint, np.random.Generator, optional: A seed or generator used to generate splits, or None to use the default random number generator state.

Returns:

Confidence object or list of Confidence objects: An object or a list of objects containing the confidence estimates at various levels (i.e. PSMs, peptides) when assessed using the learned score. If a list, they will be in the same order as provided in the psms parameter.
list of Model objects: The learned Model objects, one for each fold.

mokapot.to_txt(conf, dest_dir=None, file_root=None, sep='\t', decoys=False)[source]

Save confidence estimates to delimited text files.

Write the confidence estimates for each of the available levels (i.e. PSMs, peptides, proteins) to separate flat text files using the specified delimiter. If more than one collection of confidence estimates is provided, they will be combined, yielding a single file for each level specified by either dataset.

Parameters:

confConfidence object or tuple of Confidence objects: One or more LinearConfidence objects.
dest_dirstr or None, optional: The directory in which to save the files. None will use the current working directory.
file_rootstr or None, optional: An optional prefix for the confidence estimate files. The suffix will always be “mokapot.{level}.txt” where “{level}” indicates the level at which confidence estimation was performed (i.e. PSMs, peptides, proteins).
sepstr, optional: The delimiter to use.
decoysbool, optional: Save decoys confidence estimates as well?

Returns:

list of str: The paths to the saved files.

mokapot.to_flashlfq(conf, out_file='mokapot.flashlfq.txt')[source]

Save confidenct peptides for quantification with FlashLFQ.

FlashLFQ is an open-source tool for label-free quantification. For mokapot to save results in a compatible format, a few extra columns are required to be present, which specify the MS data file name, the theoretical peptide monoisotopic mass, the retention time, and the charge for each PSM. If these are not present, saving to the FlashLFQ format is disabled.

Note that protein grouping in the FlashLFQ results will be more accurate if proteins were added for analysis with mokapot.

Parameters:

confConfidence object or tuple of Confidence objects: One or more LinearConfidence objects.
out_filestr, optional: The output file to write.

Returns:

str: The path to the saved file.

Utility Functions

mokapot.save_model(model, out_file)[source]

Save a mokapot.model.Model object to a file.

Parameters:

out_filestr: The name of the file for the saved model.

Returns:

str: The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

mokapot.load_model(model_file)[source]

Load a saved model for mokapot.

The saved model can either be a saved Model object or the output model weights from Percolator. In Percolator, these can be obtained using the --weights argument.

Parameters:

model_filestr: The name of file from which to load the model.

Returns:

mokapot.model.Model: The loaded mokapot.model.Model object.

Warning

Unpickling data in Python is unsafe. Make sure that the model is from a source that you trust.

mokapot.read_percolator(perc_file)[source]

Read a Percolator tab-delimited file.

Percolator input format (PIN) files and the Percolator result files are tab-delimited, but also have a tab-delimited protein list as the final column. This function parses the file and returns a DataFrame.

Parameters:

perc_filestr: The file to parse.

Returns:

pandas.DataFrame: A DataFrame of the parsed data.

mokapot.plot_qvalues(qvalues, threshold=0.1, ax=None, **kwargs)[source]

Plot the cumulative number of discoveries over range of q-values.

Parameters:

qvaluesnumpy.ndarray: The q-values to plot.
thresholdfloat, optional: Indicates the maximum q-value to plot.
axmatplotlib.pyplot.Axes, optional: The matplotlib Axes on which to plot. If None the current Axes instance is used.
**kwargsdict, optional: Arguments passed to matplotlib.axes.Axes.plot().

Returns:

matplotlib.pyplot.Axes: An matplotlib.axes.Axes with the cumulative number of accepted target PSMs or peptides.

mokapot.make_decoys(fasta, out_file, decoy_prefix='decoy_', enzyme='[KR]', reverse=False, concatenate=True)[source]

Create a FASTA file with decoy sequences.

Decoy sequences are generated by shuffling or reversing each enzymatic peptide in a sequence, preserving the first and last amino acids.

Parameters:

fastastr or list of str: One or more FASTA files containing target sequences.
out_filestr: The name of the output FASTA file.
enzymestr or compiled regex, optional: A regular expression defining the enzyme specificity was used when assigning PSMs. The cleavage site is interpreted as the end of the match. The default is trypsin, without proline suppression: “[KR]”.
decoy_prefixstr, optional: The prefix used to indicate a decoy protein.
reversebool, optional: Use reversed instead of shuffled sequences? Note that the difference here is arbitrary, because reversing can be thought of as a specific instance of shuffling.
concatenatebool, optional: Concatenate decoy sequences to the provided target sequences? True creates a FASTA file with target and decoy sequences; False creates a FASTA file with only decoy sequences.

Returns:

str: The output FASTA file.

mokapot.digest(sequence, enzyme_regex='[KR]', missed_cleavages=0, clip_nterm_methionine=False, min_length=6, max_length=50, semi=False)[source]

Digest a protein sequence into its constituent peptides.

Parameters:

sequencestr: A protein sequence to digest.
enzyme_regexstr or compiled regex, optional: A regular expression defining the enzyme specificity. The end of the match should indicate the cleavage site.
missed_cleavagesint, optional: The maximum number of allowed missed cleavages.
clip_nterm_methioninebool, optional: Remove methionine residues that occur at the protein N-terminus.
min_lengthint, optional: The minimum peptide length.
max_lengthint, optional: The maximum peptide length.
semibool: Allow semi-enzymatic cleavage.

Returns:

peptidesset of str: The peptides resulting from the digested sequence.