Collections of PSMs

The LinearPsmDataset class is used to define a collection peptide-spectrum matches. The LinearPsmDataset class is suitable for most types of data-dependent acquisition proteomics experiments.

Although the class can be constructed from a pandas.DataFrame, it is often easier to load the PSMs directly from a file in the Percolator tab-delimited format (also known as the Percolator input format, or “PIN”) using the read_pin() function or from a PepXML file using the read_pepxml() function. If protein-level confidence estimates are desired, make sure to use the add_proteins() method.

One of more instance of this class are required to use the brew() function.

class mokapot.dataset.LinearPsmDataset(psms, target_column, spectrum_columns, peptide_column, protein_column=None, group_column=None, feature_columns=None, filename_column=None, scan_column=None, calcmass_column=None, expmass_column=None, rt_column=None, charge_column=None, copy_data=True, rng=None)[source]

Store and analyze a collection of PSMs.

Store a collection of PSMs from data-dependent acquisition proteomics experiments and and pepare them for mokapot analysis.

Parameters:

psmspandas.DataFrame: A collection of PSMs, where the rows are PSMs and the columns are features or metadata describing them.
target_columnstr: The column specifying whether each PSM is a target (True) or a decoy (False). This column will be coerced to boolean, so the specifying targets as 1 and decoys as -1 will not work correctly.
spectrum_columnsstr or tuple of str: The column(s) that collectively identify unique mass spectra. Multiple columns can be useful to avoid combining scans from multiple mass spectrometry runs.
peptide_columnstr: The column that defines a unique peptide. Modifications should be indicated either in square brackets [] or parentheses (). The exact modification format within these entities does not matter, so long as it is consistent.
protein_columnstr, optional: The column that specifies which protein(s) the detected peptide might have originated from. This column is not used to compute protein-level confidence estimates (see add_proteins()).
group_columnstr, optional: A factor by which to group PSMs for grouped confidence estimation.
feature_columnsstr or tuple of str, optional: The column(s) specifying the feature(s) for mokapot analysis. If None, these are assumed to be all of the columns that were not specified in the other parameters.
filename_columnstr, optional: The column specifying the mass spectrometry data file (e.g. mzML) containing each spectrum. This is required for some output formats, such as mzTab and FlashLFQ.
scan_columnstr, optional: The column specifying the scan number for each spectrum. Each value in the column should be an integer. This is required for some output formats, such as mzTab.
calcmass_columnstr, optional: The column specifying the theoretical monoisotopic mass of each peptide. This is required for some output formats, such as mzTab and FlashLFQ.
expmass_columnstr, optional: The column specifying the measured neutral precursor mass. This is required for the some ouput formats, such as mzTab.
rt_columnstr, optional: The column specifying the retention time of each spectrum, in seconds. This is required for some output formats, such as mzTab and FlashLFQ.
charge_columnstr, optional: The column specifying the charge state of each PSM. This is required for some output formats, such as mzTab and FlashLFQ.
copy_databool, optional: If true, a deep copy of psms is created, so that changes to the original collection of PSMs is not propagated to this object. This uses more memory, but is safer since it prevents accidental modification of the underlying data.
rngint or np.random.Generator, optional: A seed or generator used for cross-validation split creation and to break ties, or None to use the default random number generator state.

Attributes:

datapandas.DataFrame: The full collection of PSMs as a pandas.DataFrame.
metadatapandas.DataFrame: A pandas.DataFrame of the metadata.
featurespandas.DataFrame: A pandas.DataFrame of the features.
spectrapandas.DataFrame: A pandas.DataFrame of the columns that uniquely identify a mass spectrum.
peptidespandas.Series: A pandas.Series of the peptide column.
groupspandas.Series: A pandas.Series of the groups for confidence estimation.
targetsnumpy.ndarray: A numpy.ndarray indicating whether each PSM is a target sequence.
columnslist of str: The columns of the dataset.
has_proteinsbool: Has a FASTA file been added?
rngnumpy.random.Generator: The random number generator for model training.

Methods

`add_proteins`(proteins, **kwargs)	Add protein information to the dataset.
`assign_confidence`([scores, desc, eval_fdr])	Assign confidence to PSMs peptides, and optionally, proteins.

property targets: A numpy.ndarray indicating whether each PSM is a target sequence.

property peptides: A pandas.Series of the peptide column.

assign_confidence(scores=None, desc=True, eval_fdr=0.01)[source]

Assign confidence to PSMs peptides, and optionally, proteins.

Two forms of confidence estimates are calculated: q-values—the minimum false discovery rate (FDR) at which a given PSM would be accepted—and posterior error probabilities (PEPs)—the probability that a given PSM is incorrect. For more information see the Confidence Estimation page.

Parameters:

scoresnumpy.ndarray: The scores by which to rank the PSMs. The default, None, uses the feature that accepts the most PSMs at an FDR threshold of eval_fdr.
descbool: Are higher scores better?
eval_fdrfloat: The FDR threshold at which to report and evaluate performance. If scores is not None, this parameter has no affect on the analysis itself, but does affect logging messages and the FDR threshold applied for some output formats, such as FlashLFQ.

Returns:

LinearConfidence: A LinearConfidence object storing the confidence estimates for the collection of PSMs.

add_proteins(proteins, **kwargs)

Add protein information to the dataset.

Protein sequence information is required to compute protein-level confidence estimates using the picked-protein approach.

Parameters:

proteinsa Proteins object or str: The Proteins object defines the mapping of peptides to proteins and the mapping of decoy proteins to their corresponding target proteins. Alternatively, a string specifying a FASTA file can be specified which will be parsed to define these mappings.
**kwargsdict: Keyword arguments to be passed to the mokapot.read_fasta() function.

property columns: The columns of the dataset.

property data: The full collection of PSMs as a pandas.DataFrame.

property features: A pandas.DataFrame of the features.

property groups: A pandas.Series of the groups for confidence estimation.

property has_proteins: Has a FASTA file been added?

property metadata: A pandas.DataFrame of the metadata.

property rng: The random number generator for model training.

property spectra: A pandas.DataFrame of the columns that uniquely identify a mass spectrum.