Functions
Primary Functions
- mokapot.read_pin(pin_files, group_column=None, filename_column=None, calcmass_column=None, expmass_column=None, rt_column=None, charge_column=None, to_df=False, copy_data=False)[source]
Read Percolator input (PIN) tab-delimited files.
Read PSMs from one or more Percolator input (PIN) tab-delmited files, aggregating them into a single
LinearPsmDataset
. For more details about the PIN file format, see the Percolator documentation.Specifically, mokapot requires specific columns in the tab-delmited files: specid, scannr, peptide, proteins, and label. Note that these column names are case insensitive. In addition to these special columns defined for the PIN format, mokapot also looks for additional columns that specify the MS data file names, theoretical monoisotopic peptide masses, the measured mass, retention times, and charge states, which are necessary to create specific output formats for downstream tools, such as FlashLFQ.
In addition to PIN tab-delimited files, the pin_files argument can be a
pandas.DataFrame
containing the above columns.Finally, mokapot does not currently support specifying a default direction or feature weights in the PIN file itself. If these are present, they will be ignored.
- Parameters:
- pin_filesstr, tuple of str, or pandas.DataFrame
One or more PIN files to read or a
pandas.DataFrame
.- group_columnstr, optional
A factor to by which to group PSMs for grouped confidence estimation.
- filename_columnstr, optional
The column specifying the MS data file. If
None
, mokapot will look for a column called “filename” (case insensitive). This is required for some output formats, such as FlashLFQ.- calcmass_columnstr, optional
The column specifying the theoretical monoisotopic mass of the peptide including modifications. If
None
, mokapot will look for a column called “calcmass” (case insensitive). This is required for some output formats, such as FlashLFQ.- expmass_columnstr, optional
The column specifying the measured neutral precursor mass. If
None
, mokapot will look for a column call “expmass” (case insensitive). This is required for some output formats.- rt_columnstr, optional
The column specifying the retention time in seconds. If
None
, mokapot will look for a column called “ret_time” (case insensitive). This is required for some output formats, such as FlashLFQ.- charge_columnstr, optional
The column specifying the charge state of each peptide. If
None
, mokapot will look for a column called “charge” (case insensitive). This is required for some output formats, such as FlashLFQ.- to_dfbool, optional
Return a
pandas.DataFrame
instead of aLinearPsmDataset
.- copy_databool, optional
If true, a deep copy of the data is created. This uses more memory, but is safer because it prevents accidental modification of the underlying data. This argument only has an effect when pin_files is a
pandas.DataFrame
- Returns:
- LinearPsmDataset
A
LinearPsmDataset
object containing the PSMs from all of the PIN files.
- mokapot.read_pepxml(pepxml_files, decoy_prefix='decoy_', exclude_features=None, open_modification_bin_size=None, to_df=False)[source]
Read PepXML files.
Read peptide-spectrum matches (PSMs) from one or more pepxml files, aggregating them into a single
LinearPsmDataset
.Specifically, mokapot will extract the search engine scores as a set of features (found under the
search_scores
tag). Additionally, mokapot will add the peptide lengths, mass error, the number of enzymatic termini and the number of missed cleavages as features.- Parameters:
- pepxml_filesstr or tuple of str
One or more PepXML files to read.
- decoy_prefixstr, optional
The prefix used to indicate a decoy protein in the description lines of the FASTA file.
- exclude_featuresstr or tuple of str, optional
One or more features to exclude from the dataset. This is useful in the case that a search engine score may be biased again decoy PSMs/CSMs.
- open_modification_bin_sizefloat, optional
If specified, modification masses are binned according to the value. The binned mass difference is appended to the end of the peptide and will be used when grouping peptides for peptide-level confidence estimation. Use this option for open modification search results. We recommend 0.01 as a good starting point.
- to_dfbool, optional
Return a
pandas.DataFrame
instead of aLinearPsmDataset
.
- Returns:
- LinearPsmDataset or pandas.DataFrame
A
LinearPsmDataset
orpandas.DataFrame
containing the parsed PSMs.
- mokapot.read_fasta(fasta_files, enzyme='[KR]', missed_cleavages=2, clip_nterm_methionine=False, min_length=6, max_length=50, semi=False, decoy_prefix='decoy_')[source]
Parse a FASTA file, storing a mapping of peptides and proteins.
Protein sequence information from the FASTA file is required to compute protein-level confidence estimates using the picked-protein approach. Decoys proteins must be included and must be of the have a description in format of <prefix><protein ID> for valid confidence estimates to be calculated.
If you need to generate an appropriate FASTA file with decoy sequences for your database search, see
mokapot.make_decoys()
.Importantly, the parameters below should match the conditions in which the PSMs were assigned as closely as possible. Enzyme specificity is provided using a regular expression. A table of common enzymes can be found here in the mokapot cookbook.
- Parameters:
- fasta_filesstr or tuple of str
The FASTA file(s) used for assigning the PSMs
- decoy_prefixstr, optional
The prefix used to indicate a decoy protein in the description lines of the FASTA file.
- enzymestr or compiled regex, optional
A regular expression defining the enzyme specificity was used when assigning PSMs. The cleavage site is interpreted as the end of the match. The default is trypsin, without proline suppression: “[KR]”.
- missed_cleavagesint, optional
The allowed number of missed cleavages.
- clip_nterm_methioninebool, optional
Remove methionine residues that occur at the protein N-terminus.
- min_lengthint, optional
The minimum peptide length to consider.
- max_lengthint, optional
The maximum peptide length to consider.
- semibool, optional
Was a semi-enzymatic digest used to assign PSMs? If
True
, the protein database will likely contain many shared peptides and yield unhelpful protein-level confidence estimates.
- Returns:
- Proteins object
The parsed proteins as a
Proteins
object.
- mokapot.brew(psms, model=None, test_fdr=0.01, folds=3, max_workers=1, rng=None)[source]
Re-score one or more collection of PSMs.
The provided PSMs analyzed using the semi-supervised learning algorithm that was introduced by Percolator. Cross-validation is used to ensure that the learned models to not overfit to the PSMs used for model training. If a multiple collections of PSMs are provided, they are aggregated for model training, but the confidence estimates are calculated separately for each collection.
A list of previously trained models can be provided to the
model
argument to rescore the PSMs in each fold. Note that the number of models must matchfolds
. Furthermore, it is valid to use the learned models on the same dataset from which they were trained, but they must be provided in the same order, such that the relationship of the cross-validation folds is maintained.- Parameters:
- psmsPsmDataset object or list of PsmDataset objects
One or more collections of PSMs objects. PSMs are aggregated across all of the collections for model training, but the confidence estimates are calculated and returned separately.
- model: Model object or list of Model objects, optional
The
mokapot.Model
object to be fit. The default isNone
, which attempts to mimic the same support vector machine models used by Percolator. If a list ofmokapot.Model
objects is provided, they are assumed to be previously trained models and will and one will be used to rescore each fold.- test_fdrfloat, optional
The false-discovery rate threshold at which to evaluate the learned models.
- foldsint, optional
The number of cross-validation folds to use. PSMs originating from the same mass spectrum are always in the same fold.
- max_workersint, optional
The number of processes to use for model training. More workers will require more memory, but will typically decrease the total run time. An integer exceeding the number of folds will have no additional effect. Note that logging messages will be garbled if more than one worker is enabled.
- rngint, np.random.Generator, optional
A seed or generator used to generate splits, or None to use the default random number generator state.
- Returns:
- Confidence object or list of Confidence objects
An object or a list of objects containing the confidence estimates at various levels (i.e. PSMs, peptides) when assessed using the learned score. If a list, they will be in the same order as provided in the psms parameter.
- list of Model objects
The learned
Model
objects, one for each fold.
- mokapot.to_txt(conf, dest_dir=None, file_root=None, sep='\t', decoys=False)[source]
Save confidence estimates to delimited text files.
Write the confidence estimates for each of the available levels (i.e. PSMs, peptides, proteins) to separate flat text files using the specified delimiter. If more than one collection of confidence estimates is provided, they will be combined, yielding a single file for each level specified by either dataset.
- Parameters:
- confConfidence object or tuple of Confidence objects
One or more
LinearConfidence
objects.- dest_dirstr or None, optional
The directory in which to save the files.
None
will use the current working directory.- file_rootstr or None, optional
An optional prefix for the confidence estimate files. The suffix will always be “mokapot.{level}.txt” where “{level}” indicates the level at which confidence estimation was performed (i.e. PSMs, peptides, proteins).
- sepstr, optional
The delimiter to use.
- decoysbool, optional
Save decoys confidence estimates as well?
- Returns:
- list of str
The paths to the saved files.
- mokapot.to_flashlfq(conf, out_file='mokapot.flashlfq.txt')[source]
Save confidenct peptides for quantification with FlashLFQ.
FlashLFQ is an open-source tool for label-free quantification. For mokapot to save results in a compatible format, a few extra columns are required to be present, which specify the MS data file name, the theoretical peptide monoisotopic mass, the retention time, and the charge for each PSM. If these are not present, saving to the FlashLFQ format is disabled.
Note that protein grouping in the FlashLFQ results will be more accurate if proteins were added for analysis with mokapot.
- Parameters:
- confConfidence object or tuple of Confidence objects
One or more
LinearConfidence
objects.- out_filestr, optional
The output file to write.
- Returns:
- str
The path to the saved file.
Utility Functions
- mokapot.save_model(model, out_file)[source]
Save a
mokapot.model.Model
object to a file.- Parameters:
- out_filestr
The name of the file for the saved model.
- Returns:
- str
The output file name.
Notes
Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.
- mokapot.load_model(model_file)[source]
Load a saved model for mokapot.
The saved model can either be a saved
Model
object or the output model weights from Percolator. In Percolator, these can be obtained using the--weights
argument.- Parameters:
- model_filestr
The name of file from which to load the model.
- Returns:
- mokapot.model.Model
The loaded
mokapot.model.Model
object.
Warning
Unpickling data in Python is unsafe. Make sure that the model is from a source that you trust.
- mokapot.read_percolator(perc_file)[source]
Read a Percolator tab-delimited file.
Percolator input format (PIN) files and the Percolator result files are tab-delimited, but also have a tab-delimited protein list as the final column. This function parses the file and returns a DataFrame.
- Parameters:
- perc_filestr
The file to parse.
- Returns:
- pandas.DataFrame
A DataFrame of the parsed data.
- mokapot.plot_qvalues(qvalues, threshold=0.1, ax=None, **kwargs)[source]
Plot the cumulative number of discoveries over range of q-values.
- Parameters:
- qvaluesnumpy.ndarray
The q-values to plot.
- thresholdfloat, optional
Indicates the maximum q-value to plot.
- axmatplotlib.pyplot.Axes, optional
The matplotlib Axes on which to plot. If None the current Axes instance is used.
- **kwargsdict, optional
Arguments passed to
matplotlib.axes.Axes.plot()
.
- Returns:
- matplotlib.pyplot.Axes
An
matplotlib.axes.Axes
with the cumulative number of accepted target PSMs or peptides.
- mokapot.make_decoys(fasta, out_file, decoy_prefix='decoy_', enzyme='[KR]', reverse=False, concatenate=True)[source]
Create a FASTA file with decoy sequences.
Decoy sequences are generated by shuffling or reversing each enzymatic peptide in a sequence, preserving the first and last amino acids.
- Parameters:
- fastastr or list of str
One or more FASTA files containing target sequences.
- out_filestr
The name of the output FASTA file.
- enzymestr or compiled regex, optional
A regular expression defining the enzyme specificity was used when assigning PSMs. The cleavage site is interpreted as the end of the match. The default is trypsin, without proline suppression: “[KR]”.
- decoy_prefixstr, optional
The prefix used to indicate a decoy protein.
- reversebool, optional
Use reversed instead of shuffled sequences? Note that the difference here is arbitrary, because reversing can be thought of as a specific instance of shuffling.
- concatenatebool, optional
Concatenate decoy sequences to the provided target sequences?
True
creates a FASTA file with target and decoy sequences;False
creates a FASTA file with only decoy sequences.
- Returns:
- str
The output FASTA file.
- mokapot.digest(sequence, enzyme_regex='[KR]', missed_cleavages=0, clip_nterm_methionine=False, min_length=6, max_length=50, semi=False)[source]
Digest a protein sequence into its constituent peptides.
- Parameters:
- sequencestr
A protein sequence to digest.
- enzyme_regexstr or compiled regex, optional
A regular expression defining the enzyme specificity. The end of the match should indicate the cleavage site.
- missed_cleavagesint, optional
The maximum number of allowed missed cleavages.
- clip_nterm_methioninebool, optional
Remove methionine residues that occur at the protein N-terminus.
- min_lengthint, optional
The minimum peptide length.
- max_lengthint, optional
The maximum peptide length.
- semibool
Allow semi-enzymatic cleavage.
- Returns:
- peptidesset of str
The peptides resulting from the digested sequence.