Functions

Primary Functions

mokapot.read_pin(pin_files, group_column=None, filename_column=None, calcmass_column=None, expmass_column=None, rt_column=None, charge_column=None, to_df=False, copy_data=False)[source]

Read Percolator input (PIN) tab-delimited files.

Read PSMs from one or more Percolator input (PIN) tab-delmited files, aggregating them into a single LinearPsmDataset. For more details about the PIN file format, see the Percolator documentation.

Specifically, mokapot requires specific columns in the tab-delmited files: specid, scannr, peptide, proteins, and label. Note that these column names are case insensitive. In addition to these special columns defined for the PIN format, mokapot also looks for additional columns that specify the MS data file names, theoretical monoisotopic peptide masses, the measured mass, retention times, and charge states, which are necessary to create specific output formats for downstream tools, such as FlashLFQ.

In addition to PIN tab-delimited files, the pin_files argument can be a pandas.DataFrame containing the above columns.

Finally, mokapot does not currently support specifying a default direction or feature weights in the PIN file itself. If these are present, they will be ignored.

Parameters:
pin_filesstr, tuple of str, or pandas.DataFrame

One or more PIN files to read or a pandas.DataFrame.

group_columnstr, optional

A factor to by which to group PSMs for grouped confidence estimation.

filename_columnstr, optional

The column specifying the MS data file. If None, mokapot will look for a column called “filename” (case insensitive). This is required for some output formats, such as FlashLFQ.

calcmass_columnstr, optional

The column specifying the theoretical monoisotopic mass of the peptide including modifications. If None, mokapot will look for a column called “calcmass” (case insensitive). This is required for some output formats, such as FlashLFQ.

expmass_columnstr, optional

The column specifying the measured neutral precursor mass. If None, mokapot will look for a column call “expmass” (case insensitive). This is required for some output formats.

rt_columnstr, optional

The column specifying the retention time in seconds. If None, mokapot will look for a column called “ret_time” (case insensitive). This is required for some output formats, such as FlashLFQ.

charge_columnstr, optional

The column specifying the charge state of each peptide. If None, mokapot will look for a column called “charge” (case insensitive). This is required for some output formats, such as FlashLFQ.

to_dfbool, optional

Return a pandas.DataFrame instead of a LinearPsmDataset.

copy_databool, optional

If true, a deep copy of the data is created. This uses more memory, but is safer because it prevents accidental modification of the underlying data. This argument only has an effect when pin_files is a pandas.DataFrame

Returns:
LinearPsmDataset

A LinearPsmDataset object containing the PSMs from all of the PIN files.

mokapot.read_pepxml(pepxml_files, decoy_prefix='decoy_', exclude_features=None, open_modification_bin_size=None, to_df=False)[source]

Read PepXML files.

Read peptide-spectrum matches (PSMs) from one or more pepxml files, aggregating them into a single LinearPsmDataset.

Specifically, mokapot will extract the search engine scores as a set of features (found under the search_scores tag). Additionally, mokapot will add the peptide lengths, mass error, the number of enzymatic termini and the number of missed cleavages as features.

Parameters:
pepxml_filesstr or tuple of str

One or more PepXML files to read.

decoy_prefixstr, optional

The prefix used to indicate a decoy protein in the description lines of the FASTA file.

exclude_featuresstr or tuple of str, optional

One or more features to exclude from the dataset. This is useful in the case that a search engine score may be biased again decoy PSMs/CSMs.

open_modification_bin_sizefloat, optional

If specified, modification masses are binned according to the value. The binned mass difference is appended to the end of the peptide and will be used when grouping peptides for peptide-level confidence estimation. Use this option for open modification search results. We recommend 0.01 as a good starting point.

to_dfbool, optional

Return a pandas.DataFrame instead of a LinearPsmDataset.

Returns:
LinearPsmDataset or pandas.DataFrame

A LinearPsmDataset or pandas.DataFrame containing the parsed PSMs.

mokapot.read_fasta(fasta_files, enzyme='[KR]', missed_cleavages=2, clip_nterm_methionine=False, min_length=6, max_length=50, semi=False, decoy_prefix='decoy_')[source]

Parse a FASTA file, storing a mapping of peptides and proteins.

Protein sequence information from the FASTA file is required to compute protein-level confidence estimates using the picked-protein approach. Decoys proteins must be included and must be of the have a description in format of <prefix><protein ID> for valid confidence estimates to be calculated.

If you need to generate an appropriate FASTA file with decoy sequences for your database search, see mokapot.make_decoys().

Importantly, the parameters below should match the conditions in which the PSMs were assigned as closely as possible. Enzyme specificity is provided using a regular expression. A table of common enzymes can be found here in the mokapot cookbook.

Parameters:
fasta_filesstr or tuple of str

The FASTA file(s) used for assigning the PSMs

decoy_prefixstr, optional

The prefix used to indicate a decoy protein in the description lines of the FASTA file.

enzymestr or compiled regex, optional

A regular expression defining the enzyme specificity was used when assigning PSMs. The cleavage site is interpreted as the end of the match. The default is trypsin, without proline suppression: “[KR]”.

missed_cleavagesint, optional

The allowed number of missed cleavages.

clip_nterm_methioninebool, optional

Remove methionine residues that occur at the protein N-terminus.

min_lengthint, optional

The minimum peptide length to consider.

max_lengthint, optional

The maximum peptide length to consider.

semibool, optional

Was a semi-enzymatic digest used to assign PSMs? If True, the protein database will likely contain many shared peptides and yield unhelpful protein-level confidence estimates.

Returns:
Proteins object

The parsed proteins as a Proteins object.

mokapot.brew(psms, model=None, test_fdr=0.01, folds=3, max_workers=1, rng=None)[source]

Re-score one or more collection of PSMs.

The provided PSMs analyzed using the semi-supervised learning algorithm that was introduced by Percolator. Cross-validation is used to ensure that the learned models to not overfit to the PSMs used for model training. If a multiple collections of PSMs are provided, they are aggregated for model training, but the confidence estimates are calculated separately for each collection.

A list of previously trained models can be provided to the model argument to rescore the PSMs in each fold. Note that the number of models must match folds. Furthermore, it is valid to use the learned models on the same dataset from which they were trained, but they must be provided in the same order, such that the relationship of the cross-validation folds is maintained.

Parameters:
psmsPsmDataset object or list of PsmDataset objects

One or more collections of PSMs objects. PSMs are aggregated across all of the collections for model training, but the confidence estimates are calculated and returned separately.

model: Model object or list of Model objects, optional

The mokapot.Model object to be fit. The default is None, which attempts to mimic the same support vector machine models used by Percolator. If a list of mokapot.Model objects is provided, they are assumed to be previously trained models and will and one will be used to rescore each fold.

test_fdrfloat, optional

The false-discovery rate threshold at which to evaluate the learned models.

foldsint, optional

The number of cross-validation folds to use. PSMs originating from the same mass spectrum are always in the same fold.

max_workersint, optional

The number of processes to use for model training. More workers will require more memory, but will typically decrease the total run time. An integer exceeding the number of folds will have no additional effect. Note that logging messages will be garbled if more than one worker is enabled.

rngint, np.random.Generator, optional

A seed or generator used to generate splits, or None to use the default random number generator state.

Returns:
Confidence object or list of Confidence objects

An object or a list of objects containing the confidence estimates at various levels (i.e. PSMs, peptides) when assessed using the learned score. If a list, they will be in the same order as provided in the psms parameter.

list of Model objects

The learned Model objects, one for each fold.

mokapot.to_txt(conf, dest_dir=None, file_root=None, sep='\t', decoys=False)[source]

Save confidence estimates to delimited text files.

Write the confidence estimates for each of the available levels (i.e. PSMs, peptides, proteins) to separate flat text files using the specified delimiter. If more than one collection of confidence estimates is provided, they will be combined, yielding a single file for each level specified by either dataset.

Parameters:
confConfidence object or tuple of Confidence objects

One or more LinearConfidence objects.

dest_dirstr or None, optional

The directory in which to save the files. None will use the current working directory.

file_rootstr or None, optional

An optional prefix for the confidence estimate files. The suffix will always be “mokapot.{level}.txt” where “{level}” indicates the level at which confidence estimation was performed (i.e. PSMs, peptides, proteins).

sepstr, optional

The delimiter to use.

decoysbool, optional

Save decoys confidence estimates as well?

Returns:
list of str

The paths to the saved files.

mokapot.to_flashlfq(conf, out_file='mokapot.flashlfq.txt')[source]

Save confidenct peptides for quantification with FlashLFQ.

FlashLFQ is an open-source tool for label-free quantification. For mokapot to save results in a compatible format, a few extra columns are required to be present, which specify the MS data file name, the theoretical peptide monoisotopic mass, the retention time, and the charge for each PSM. If these are not present, saving to the FlashLFQ format is disabled.

Note that protein grouping in the FlashLFQ results will be more accurate if proteins were added for analysis with mokapot.

Parameters:
confConfidence object or tuple of Confidence objects

One or more LinearConfidence objects.

out_filestr, optional

The output file to write.

Returns:
str

The path to the saved file.

Utility Functions

mokapot.save_model(model, out_file)[source]

Save a mokapot.model.Model object to a file.

Parameters:
out_filestr

The name of the file for the saved model.

Returns:
str

The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

mokapot.load_model(model_file)[source]

Load a saved model for mokapot.

The saved model can either be a saved Model object or the output model weights from Percolator. In Percolator, these can be obtained using the --weights argument.

Parameters:
model_filestr

The name of file from which to load the model.

Returns:
mokapot.model.Model

The loaded mokapot.model.Model object.

Warning

Unpickling data in Python is unsafe. Make sure that the model is from a source that you trust.

mokapot.read_percolator(perc_file)[source]

Read a Percolator tab-delimited file.

Percolator input format (PIN) files and the Percolator result files are tab-delimited, but also have a tab-delimited protein list as the final column. This function parses the file and returns a DataFrame.

Parameters:
perc_filestr

The file to parse.

Returns:
pandas.DataFrame

A DataFrame of the parsed data.

mokapot.plot_qvalues(qvalues, threshold=0.1, ax=None, **kwargs)[source]

Plot the cumulative number of discoveries over range of q-values.

Parameters:
qvaluesnumpy.ndarray

The q-values to plot.

thresholdfloat, optional

Indicates the maximum q-value to plot.

axmatplotlib.pyplot.Axes, optional

The matplotlib Axes on which to plot. If None the current Axes instance is used.

**kwargsdict, optional

Arguments passed to matplotlib.axes.Axes.plot().

Returns:
matplotlib.pyplot.Axes

An matplotlib.axes.Axes with the cumulative number of accepted target PSMs or peptides.

mokapot.make_decoys(fasta, out_file, decoy_prefix='decoy_', enzyme='[KR]', reverse=False, concatenate=True)[source]

Create a FASTA file with decoy sequences.

Decoy sequences are generated by shuffling or reversing each enzymatic peptide in a sequence, preserving the first and last amino acids.

Parameters:
fastastr or list of str

One or more FASTA files containing target sequences.

out_filestr

The name of the output FASTA file.

enzymestr or compiled regex, optional

A regular expression defining the enzyme specificity was used when assigning PSMs. The cleavage site is interpreted as the end of the match. The default is trypsin, without proline suppression: “[KR]”.

decoy_prefixstr, optional

The prefix used to indicate a decoy protein.

reversebool, optional

Use reversed instead of shuffled sequences? Note that the difference here is arbitrary, because reversing can be thought of as a specific instance of shuffling.

concatenatebool, optional

Concatenate decoy sequences to the provided target sequences? True creates a FASTA file with target and decoy sequences; False creates a FASTA file with only decoy sequences.

Returns:
str

The output FASTA file.

mokapot.digest(sequence, enzyme_regex='[KR]', missed_cleavages=0, clip_nterm_methionine=False, min_length=6, max_length=50, semi=False)[source]

Digest a protein sequence into its constituent peptides.

Parameters:
sequencestr

A protein sequence to digest.

enzyme_regexstr or compiled regex, optional

A regular expression defining the enzyme specificity. The end of the match should indicate the cleavage site.

missed_cleavagesint, optional

The maximum number of allowed missed cleavages.

clip_nterm_methioninebool, optional

Remove methionine residues that occur at the protein N-terminus.

min_lengthint, optional

The minimum peptide length.

max_lengthint, optional

The maximum peptide length.

semibool

Allow semi-enzymatic cleavage.

Returns:
peptidesset of str

The peptides resulting from the digested sequence.