Machine Learning Models

mokapot implements an algorithm for training machine learning models to distinguish high-scoring target peptide-spectrum matches (PSMs) from decoy PSMs using an iterative procedure. It is the Model class that contains this logic. A Model instance can be created from any object with a scikit-learn estimator interface, allowing a wide variety of models to be used. Once initialized, the Model.fit() method trains the underyling classifier using a collection of PSMs with this iterative approach.

Additional subclasses of the Model class are available for typical use cases. For example, use PercolatorModel if you want to emulate the behavior of Percolator.

class mokapot.model.Model(estimator, scaler=None, train_fdr=0.01, max_iter=10, direction=None, override=False, subset_max_train=None, shuffle=True, rng=None)[source]

A machine learning model to re-score PSMs.

Any classifier with a scikit-learn estimator interface can be used. This class also supports hyper parameter optimization using classes from the sklearn.model_selection module, such as the GridSearchCV and RandomizedSearchCV classes.

Parameters:

estimatorclassifier object: A classifier that is assumed to implement the scikit-learn estimator interface. To emulate Percolator (an SVM model) use PercolatorModel instead.
scalerscaler object or “as-is”, optional: Defines how features are normalized before model fitting and prediction. The default, None, subtracts the mean and scales to unit variance using sklearn.preprocessing.StandardScaler. Other scalers should follow the scikit-learn transformer interface , implementing fit_transform() and transform() methods. Alternatively, the string "as-is" leaves the features in their original scale.
train_fdrfloat, optional: The maximum false discovery rate at which to consider a target PSM as a positive example.
max_iterint, optional: The number of iterations to perform.
directionstr or None, optional: The name of the feature to use as the initial direction for ranking PSMs. The default, None, automatically selects the feature that finds the most PSMs below the train_fdr. This will be ignored in the case the model is already trained.
overridebool, optional: If the learned model performs worse than the best feature, should the model still be used?
subset_max_trainint or None, optional: Use only a random subset of the PSMs for training. This is useful for very large datasets or models that scale poorly with the number of PSMs. The default, None will use all of the PSMs.
shufflebool, optional: Should the order of PSMs be randomized for training? For deterministic algorithms, this will have no effect.
rngint or numpy.random.Generator, optional: The seed or generator used for model training.

Attributes:

estimatorclassifier object: The classifier used to re-score PSMs.
scalerscaler object: The scaler used to normalize features.
featureslist of str or None: The name of the features used to fit the model. None if the model has yet to be trained.
is_trainedbool: Indicates if the model has been trained.
train_fdrfloat: The maximum false discovery rate at which to consider a target PSM as a positive example.
max_iterint: The number of iterations to perform.
directionstr or None: The name of the feature to use as the initial direction for ranking PSMs.
overridebool: If the learned model performs worse than the best feature, should the model still be used?
subset_max_trainint: The number of PSMs for training.
shufflebool: Is the order of PSMs shuffled for training?
foldint or None: The CV fold on which this model was fit, if any.
rngnumpy.random.Generator: The random number generator for model training.

Methods

`decision_function`(psms)	Score a collection of PSMs
`fit`(psms)	Fit the model using the Percolator algorithm.
`predict`(psms)	Alias for `decision_function()`.
`save`(out_file)	Save the model to a file.

property rng: The random number generator for model training.

save(out_file)[source]

Save the model to a file.

Parameters:

out_filestr: The name of the file for the saved model.

Returns:

str: The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

decision_function(psms)[source]

Score a collection of PSMs

Parameters:

psmsPsmDataset object: A collection of PSMs to score.

Returns:

numpy.ndarray: A numpy.ndarray containing the score for each PSM.

predict(psms)[source]: Alias for decision_function().

fit(psms)[source]

Fit the model using the Percolator algorithm.

The model if trained by iteratively learning to separate decoy PSMs from high-scoring target PSMs. By default, an initial direction is chosen as the feature that best separates target from decoy PSMs. A false discovery rate threshold is used to define how high a target must score to be used as a positive example in the next training iteration.

Parameters:

psmsPsmDataset object: A collection of PSMs from which to train the model.

Returns:

self

class mokapot.model.PercolatorModel(scaler=None, train_fdr=0.01, max_iter=10, direction=None, override=False, subset_max_train=None, n_jobs=-1, rng=None)[source]

A model that emulates Percolator. Create linear support vector machine (SVM) model that is similar to the one used by Percolator. This is the default model used by mokapot.

Parameters:

scalerscaler object or “as-is”, optional: Defines how features are normalized before model fitting and prediction. The default, None, subtracts the mean and scales to unit variance using sklearn.preprocessing.StandardScaler. Other scalers should follow the scikit-learn transformer interface , implementing fit_transform() and transform() methods. Alternatively, the string "as-is" leaves the features in their original scale.
train_fdrfloat, optional: The maximum false discovery rate at which to consider a target PSM as a positive example.
max_iterint, optional: The number of iterations to perform.
directionstr or None, optional: The name of the feature to use as the initial direction for ranking PSMs. The default, None, automatically selects the feature that finds the most PSMs below the train_fdr. This will be ignored in the case the model is already trained.
overridebool, optional: If the learned model performs worse than the best feature, should the model still be used?
subset_max_trainint or None, optional: Use only a random subset of the PSMs for training. This is useful for very large datasets or models that scale poorly with the number of PSMs. The default, None will use all of the PSMs.
n_jobsint, optional: The number of jobs used to parallelize the hyperparameter grid search.
rngint or numpy.random.Generator, optional: The seed or generator used for model training.

Attributes:

estimatorclassifier object: The classifier used to re-score PSMs.
scalerscaler object: The scaler used to normalize features.
featureslist of str or None: The name of the features used to fit the model. None if the model has yet to be trained.
is_trainedbool: Indicates if the model has been trained.
train_fdrfloat: The maximum false discovery rate at which to consider a target PSM as a positive example.
max_iterint: The number of iterations to perform.
directionstr or None: The name of the feature to use as the initial direction for ranking PSMs.
overridebool: If the learned model performs worse than the best feature, should the model still be used?
subset_max_trainint or None: The number of PSMs for training.
n_jobsint: The number of jobs to use for parallizing the hyperparameter grid search.
rngnumpy.random.Generator: The random number generator for model training.

Methods

`decision_function`(psms)	Score a collection of PSMs
`fit`(psms)	Fit the model using the Percolator algorithm.
`predict`(psms)	Alias for `decision_function()`.
`save`(out_file)	Save the model to a file.

decision_function(psms)

Score a collection of PSMs

Parameters:

psmsPsmDataset object: A collection of PSMs to score.

Returns:

numpy.ndarray: A numpy.ndarray containing the score for each PSM.

fit(psms)

Fit the model using the Percolator algorithm.

The model if trained by iteratively learning to separate decoy PSMs from high-scoring target PSMs. By default, an initial direction is chosen as the feature that best separates target from decoy PSMs. A false discovery rate threshold is used to define how high a target must score to be used as a positive example in the next training iteration.

Parameters:

psmsPsmDataset object: A collection of PSMs from which to train the model.

Returns:

self

predict(psms): Alias for decision_function().

property rng: The random number generator for model training.

save(out_file)

Save the model to a file.

Parameters:

out_filestr: The name of the file for the saved model.

Returns:

str: The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

mokapot.model.save_model(model, out_file)[source]

Save a mokapot.model.Model object to a file.

Parameters:

out_filestr: The name of the file for the saved model.

Returns:

str: The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

mokapot.model.load_model(model_file)[source]

Load a saved model for mokapot.

The saved model can either be a saved Model object or the output model weights from Percolator. In Percolator, these can be obtained using the --weights argument.

Parameters:

model_filestr: The name of file from which to load the model.

Returns:

mokapot.model.Model: The loaded mokapot.model.Model object.

Warning

Unpickling data in Python is unsafe. Make sure that the model is from a source that you trust.