Collections of PSMs
The LinearPsmDataset
class is used to define a collection
peptide-spectrum matches. The LinearPsmDataset
class is suitable for
most types of data-dependent acquisition proteomics experiments.
Although the class can be constructed from a pandas.DataFrame
, it
is often easier to load the PSMs directly from a file in the Percolator
tab-delimited format
(also known as the Percolator input format, or “PIN”) using the
read_pin()
function or from a PepXML file using the
read_pepxml()
function. If protein-level confidence
estimates are desired, make sure to use the
add_proteins()
method.
One of more instance of this class are required to use the
brew()
function.
- class mokapot.dataset.LinearPsmDataset(psms, target_column, spectrum_columns, peptide_column, protein_column=None, group_column=None, feature_columns=None, filename_column=None, scan_column=None, calcmass_column=None, expmass_column=None, rt_column=None, charge_column=None, copy_data=True, rng=None)[source]
Store and analyze a collection of PSMs.
Store a collection of PSMs from data-dependent acquisition proteomics experiments and and pepare them for mokapot analysis.
- Parameters:
- psmspandas.DataFrame
A collection of PSMs, where the rows are PSMs and the columns are features or metadata describing them.
- target_columnstr
The column specifying whether each PSM is a target (True) or a decoy (False). This column will be coerced to boolean, so the specifying targets as 1 and decoys as -1 will not work correctly.
- spectrum_columnsstr or tuple of str
The column(s) that collectively identify unique mass spectra. Multiple columns can be useful to avoid combining scans from multiple mass spectrometry runs.
- peptide_columnstr
The column that defines a unique peptide. Modifications should be indicated either in square brackets
[]
or parentheses()
. The exact modification format within these entities does not matter, so long as it is consistent.- protein_columnstr, optional
The column that specifies which protein(s) the detected peptide might have originated from. This column is not used to compute protein-level confidence estimates (see
add_proteins()
).- group_columnstr, optional
A factor by which to group PSMs for grouped confidence estimation.
- feature_columnsstr or tuple of str, optional
The column(s) specifying the feature(s) for mokapot analysis. If
None
, these are assumed to be all of the columns that were not specified in the other parameters.- filename_columnstr, optional
The column specifying the mass spectrometry data file (e.g. mzML) containing each spectrum. This is required for some output formats, such as mzTab and FlashLFQ.
- scan_columnstr, optional
The column specifying the scan number for each spectrum. Each value in the column should be an integer. This is required for some output formats, such as mzTab.
- calcmass_columnstr, optional
The column specifying the theoretical monoisotopic mass of each peptide. This is required for some output formats, such as mzTab and FlashLFQ.
- expmass_columnstr, optional
The column specifying the measured neutral precursor mass. This is required for the some ouput formats, such as mzTab.
- rt_columnstr, optional
The column specifying the retention time of each spectrum, in seconds. This is required for some output formats, such as mzTab and FlashLFQ.
- charge_columnstr, optional
The column specifying the charge state of each PSM. This is required for some output formats, such as mzTab and FlashLFQ.
- copy_databool, optional
If true, a deep copy of psms is created, so that changes to the original collection of PSMs is not propagated to this object. This uses more memory, but is safer since it prevents accidental modification of the underlying data.
- rngint or np.random.Generator, optional
A seed or generator used for cross-validation split creation and to break ties, or
None
to use the default random number generator state.
- Attributes:
data
pandas.DataFrameThe full collection of PSMs as a
pandas.DataFrame
.metadata
pandas.DataFrameA
pandas.DataFrame
of the metadata.features
pandas.DataFrameA
pandas.DataFrame
of the features.spectra
pandas.DataFrameA
pandas.DataFrame
of the columns that uniquely identify a mass spectrum.peptides
pandas.SeriesA
pandas.Series
of the peptide column.groups
pandas.SeriesA
pandas.Series
of the groups for confidence estimation.targets
numpy.ndarrayA
numpy.ndarray
indicating whether each PSM is a target sequence.columns
list of strThe columns of the dataset.
has_proteins
boolHas a FASTA file been added?
rng
numpy.random.GeneratorThe random number generator for model training.
Methods
add_proteins
(proteins, **kwargs)Add protein information to the dataset.
assign_confidence
([scores, desc, eval_fdr])Assign confidence to PSMs peptides, and optionally, proteins.
- property targets
A
numpy.ndarray
indicating whether each PSM is a target sequence.
- property peptides
A
pandas.Series
of the peptide column.
- assign_confidence(scores=None, desc=True, eval_fdr=0.01)[source]
Assign confidence to PSMs peptides, and optionally, proteins.
Two forms of confidence estimates are calculated: q-values—the minimum false discovery rate (FDR) at which a given PSM would be accepted—and posterior error probabilities (PEPs)—the probability that a given PSM is incorrect. For more information see the Confidence Estimation page.
- Parameters:
- scoresnumpy.ndarray
The scores by which to rank the PSMs. The default,
None
, uses the feature that accepts the most PSMs at an FDR threshold of eval_fdr.- descbool
Are higher scores better?
- eval_fdrfloat
The FDR threshold at which to report and evaluate performance. If scores is not
None
, this parameter has no affect on the analysis itself, but does affect logging messages and the FDR threshold applied for some output formats, such as FlashLFQ.
- Returns:
- LinearConfidence
A
LinearConfidence
object storing the confidence estimates for the collection of PSMs.
- add_proteins(proteins, **kwargs)
Add protein information to the dataset.
Protein sequence information is required to compute protein-level confidence estimates using the picked-protein approach.
- Parameters:
- proteinsa Proteins object or str
The
Proteins
object defines the mapping of peptides to proteins and the mapping of decoy proteins to their corresponding target proteins. Alternatively, a string specifying a FASTA file can be specified which will be parsed to define these mappings.- **kwargsdict
Keyword arguments to be passed to the
mokapot.read_fasta()
function.
- property columns
The columns of the dataset.
- property data
The full collection of PSMs as a
pandas.DataFrame
.
- property features
A
pandas.DataFrame
of the features.
- property groups
A
pandas.Series
of the groups for confidence estimation.
- property has_proteins
Has a FASTA file been added?
- property metadata
A
pandas.DataFrame
of the metadata.
- property rng
The random number generator for model training.
- property spectra
A
pandas.DataFrame
of the columns that uniquely identify a mass spectrum.