Machine Learning Models
mokapot implements an algorithm for training machine learning models to
distinguish high-scoring target peptide-spectrum matches (PSMs) from decoy PSMs
using an iterative procedure. It is the Model
class that contains
this logic. A Model
instance can be created from any object with a
scikit-learn estimator interface, allowing a wide
variety of models to be used. Once initialized, the Model.fit()
method
trains the underyling classifier using a collection of PSMs
with this iterative approach.
Additional subclasses of the Model
class are available for
typical use cases. For example, use PercolatorModel
if you
want to emulate the behavior of Percolator.
- class mokapot.model.Model(estimator, scaler=None, train_fdr=0.01, max_iter=10, direction=None, override=False, subset_max_train=None, shuffle=True, rng=None)[source]
A machine learning model to re-score PSMs.
Any classifier with a scikit-learn estimator interface can be used. This class also supports hyper parameter optimization using classes from the
sklearn.model_selection
module, such as theGridSearchCV
andRandomizedSearchCV
classes.- Parameters:
- estimatorclassifier object
A classifier that is assumed to implement the scikit-learn estimator interface. To emulate Percolator (an SVM model) use
PercolatorModel
instead.- scalerscaler object or “as-is”, optional
Defines how features are normalized before model fitting and prediction. The default,
None
, subtracts the mean and scales to unit variance usingsklearn.preprocessing.StandardScaler
. Other scalers should follow the scikit-learn transformer interface , implementingfit_transform()
andtransform()
methods. Alternatively, the string"as-is"
leaves the features in their original scale.- train_fdrfloat, optional
The maximum false discovery rate at which to consider a target PSM as a positive example.
- max_iterint, optional
The number of iterations to perform.
- directionstr or None, optional
The name of the feature to use as the initial direction for ranking PSMs. The default,
None
, automatically selects the feature that finds the most PSMs below the train_fdr. This will be ignored in the case the model is already trained.- overridebool, optional
If the learned model performs worse than the best feature, should the model still be used?
- subset_max_trainint or None, optional
Use only a random subset of the PSMs for training. This is useful for very large datasets or models that scale poorly with the number of PSMs. The default,
None
will use all of the PSMs.- shufflebool, optional
Should the order of PSMs be randomized for training? For deterministic algorithms, this will have no effect.
- rngint or numpy.random.Generator, optional
The seed or generator used for model training.
- Attributes:
- estimatorclassifier object
The classifier used to re-score PSMs.
- scalerscaler object
The scaler used to normalize features.
- featureslist of str or None
The name of the features used to fit the model. None if the model has yet to be trained.
- is_trainedbool
Indicates if the model has been trained.
- train_fdrfloat
The maximum false discovery rate at which to consider a target PSM as a positive example.
- max_iterint
The number of iterations to perform.
- directionstr or None
The name of the feature to use as the initial direction for ranking PSMs.
- overridebool
If the learned model performs worse than the best feature, should the model still be used?
- subset_max_trainint
The number of PSMs for training.
- shufflebool
Is the order of PSMs shuffled for training?
- foldint or None
The CV fold on which this model was fit, if any.
rng
numpy.random.GeneratorThe random number generator for model training.
Methods
decision_function
(psms)Score a collection of PSMs
fit
(psms)Fit the model using the Percolator algorithm.
predict
(psms)Alias for
decision_function()
.save
(out_file)Save the model to a file.
- property rng
The random number generator for model training.
- save(out_file)[source]
Save the model to a file.
- Parameters:
- out_filestr
The name of the file for the saved model.
- Returns:
- str
The output file name.
Notes
Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.
- decision_function(psms)[source]
Score a collection of PSMs
- Parameters:
- psmsPsmDataset object
A collection of PSMs to score.
- Returns:
- numpy.ndarray
A
numpy.ndarray
containing the score for each PSM.
- predict(psms)[source]
Alias for
decision_function()
.
- fit(psms)[source]
Fit the model using the Percolator algorithm.
The model if trained by iteratively learning to separate decoy PSMs from high-scoring target PSMs. By default, an initial direction is chosen as the feature that best separates target from decoy PSMs. A false discovery rate threshold is used to define how high a target must score to be used as a positive example in the next training iteration.
- Parameters:
- psmsPsmDataset object
A collection of PSMs from which to train the model.
- Returns:
- self
- class mokapot.model.PercolatorModel(scaler=None, train_fdr=0.01, max_iter=10, direction=None, override=False, subset_max_train=None, n_jobs=-1, rng=None)[source]
A model that emulates Percolator. Create linear support vector machine (SVM) model that is similar to the one used by Percolator. This is the default model used by mokapot.
- Parameters:
- scalerscaler object or “as-is”, optional
Defines how features are normalized before model fitting and prediction. The default,
None
, subtracts the mean and scales to unit variance usingsklearn.preprocessing.StandardScaler
. Other scalers should follow the scikit-learn transformer interface , implementingfit_transform()
andtransform()
methods. Alternatively, the string"as-is"
leaves the features in their original scale.- train_fdrfloat, optional
The maximum false discovery rate at which to consider a target PSM as a positive example.
- max_iterint, optional
The number of iterations to perform.
- directionstr or None, optional
The name of the feature to use as the initial direction for ranking PSMs. The default,
None
, automatically selects the feature that finds the most PSMs below the train_fdr. This will be ignored in the case the model is already trained.- overridebool, optional
If the learned model performs worse than the best feature, should the model still be used?
- subset_max_trainint or None, optional
Use only a random subset of the PSMs for training. This is useful for very large datasets or models that scale poorly with the number of PSMs. The default,
None
will use all of the PSMs.- n_jobsint, optional
The number of jobs used to parallelize the hyperparameter grid search.
- rngint or numpy.random.Generator, optional
The seed or generator used for model training.
- Attributes:
- estimatorclassifier object
The classifier used to re-score PSMs.
- scalerscaler object
The scaler used to normalize features.
- featureslist of str or None
The name of the features used to fit the model. None if the model has yet to be trained.
- is_trainedbool
Indicates if the model has been trained.
- train_fdrfloat
The maximum false discovery rate at which to consider a target PSM as a positive example.
- max_iterint
The number of iterations to perform.
- directionstr or None
The name of the feature to use as the initial direction for ranking PSMs.
- overridebool
If the learned model performs worse than the best feature, should the model still be used?
- subset_max_trainint or None
The number of PSMs for training.
- n_jobsint
The number of jobs to use for parallizing the hyperparameter grid search.
rng
numpy.random.GeneratorThe random number generator for model training.
Methods
decision_function
(psms)Score a collection of PSMs
fit
(psms)Fit the model using the Percolator algorithm.
predict
(psms)Alias for
decision_function()
.save
(out_file)Save the model to a file.
- decision_function(psms)
Score a collection of PSMs
- Parameters:
- psmsPsmDataset object
A collection of PSMs to score.
- Returns:
- numpy.ndarray
A
numpy.ndarray
containing the score for each PSM.
- fit(psms)
Fit the model using the Percolator algorithm.
The model if trained by iteratively learning to separate decoy PSMs from high-scoring target PSMs. By default, an initial direction is chosen as the feature that best separates target from decoy PSMs. A false discovery rate threshold is used to define how high a target must score to be used as a positive example in the next training iteration.
- Parameters:
- psmsPsmDataset object
A collection of PSMs from which to train the model.
- Returns:
- self
- predict(psms)
Alias for
decision_function()
.
- property rng
The random number generator for model training.
- save(out_file)
Save the model to a file.
- Parameters:
- out_filestr
The name of the file for the saved model.
- Returns:
- str
The output file name.
Notes
Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.
- mokapot.model.save_model(model, out_file)[source]
Save a
mokapot.model.Model
object to a file.- Parameters:
- out_filestr
The name of the file for the saved model.
- Returns:
- str
The output file name.
Notes
Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.
- mokapot.model.load_model(model_file)[source]
Load a saved model for mokapot.
The saved model can either be a saved
Model
object or the output model weights from Percolator. In Percolator, these can be obtained using the--weights
argument.- Parameters:
- model_filestr
The name of file from which to load the model.
- Returns:
- mokapot.model.Model
The loaded
mokapot.model.Model
object.
Warning
Unpickling data in Python is unsafe. Make sure that the model is from a source that you trust.