Machine Learning Models

mokapot implements an algorithm for training machine learning models to distinguish high-scoring target peptide-spectrum matches (PSMs) from decoy PSMs using an iterative procedure. It is the Model class that contains this logic. A Model instance can be created from any object with a scikit-learn estimator interface, allowing a wide variety of models to be used. Once initialized, the Model.fit() method trains the underyling classifier using a collection of PSMs with this iterative approach.

Additional subclasses of the Model class are available for typical use cases. For example, use PercolatorModel if you want to emulate the behavior of Percolator.

class mokapot.model.Model(estimator, scaler=None, train_fdr=0.01, max_iter=10, direction=None, override=False, subset_max_train=None, shuffle=True, rng=None)[source]

A machine learning model to re-score PSMs.

Any classifier with a scikit-learn estimator interface can be used. This class also supports hyper parameter optimization using classes from the sklearn.model_selection module, such as the GridSearchCV and RandomizedSearchCV classes.

Parameters:
estimatorclassifier object

A classifier that is assumed to implement the scikit-learn estimator interface. To emulate Percolator (an SVM model) use PercolatorModel instead.

scalerscaler object or “as-is”, optional

Defines how features are normalized before model fitting and prediction. The default, None, subtracts the mean and scales to unit variance using sklearn.preprocessing.StandardScaler. Other scalers should follow the scikit-learn transformer interface , implementing fit_transform() and transform() methods. Alternatively, the string "as-is" leaves the features in their original scale.

train_fdrfloat, optional

The maximum false discovery rate at which to consider a target PSM as a positive example.

max_iterint, optional

The number of iterations to perform.

directionstr or None, optional

The name of the feature to use as the initial direction for ranking PSMs. The default, None, automatically selects the feature that finds the most PSMs below the train_fdr. This will be ignored in the case the model is already trained.

overridebool, optional

If the learned model performs worse than the best feature, should the model still be used?

subset_max_trainint or None, optional

Use only a random subset of the PSMs for training. This is useful for very large datasets or models that scale poorly with the number of PSMs. The default, None will use all of the PSMs.

shufflebool, optional

Should the order of PSMs be randomized for training? For deterministic algorithms, this will have no effect.

rngint or numpy.random.Generator, optional

The seed or generator used for model training.

Attributes:
estimatorclassifier object

The classifier used to re-score PSMs.

scalerscaler object

The scaler used to normalize features.

featureslist of str or None

The name of the features used to fit the model. None if the model has yet to be trained.

is_trainedbool

Indicates if the model has been trained.

train_fdrfloat

The maximum false discovery rate at which to consider a target PSM as a positive example.

max_iterint

The number of iterations to perform.

directionstr or None

The name of the feature to use as the initial direction for ranking PSMs.

overridebool

If the learned model performs worse than the best feature, should the model still be used?

subset_max_trainint

The number of PSMs for training.

shufflebool

Is the order of PSMs shuffled for training?

foldint or None

The CV fold on which this model was fit, if any.

rngnumpy.random.Generator

The random number generator for model training.

Methods

decision_function(psms)

Score a collection of PSMs

fit(psms)

Fit the model using the Percolator algorithm.

predict(psms)

Alias for decision_function().

save(out_file)

Save the model to a file.

property rng

The random number generator for model training.

save(out_file)[source]

Save the model to a file.

Parameters:
out_filestr

The name of the file for the saved model.

Returns:
str

The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

decision_function(psms)[source]

Score a collection of PSMs

Parameters:
psmsPsmDataset object

A collection of PSMs to score.

Returns:
numpy.ndarray

A numpy.ndarray containing the score for each PSM.

predict(psms)[source]

Alias for decision_function().

fit(psms)[source]

Fit the model using the Percolator algorithm.

The model if trained by iteratively learning to separate decoy PSMs from high-scoring target PSMs. By default, an initial direction is chosen as the feature that best separates target from decoy PSMs. A false discovery rate threshold is used to define how high a target must score to be used as a positive example in the next training iteration.

Parameters:
psmsPsmDataset object

A collection of PSMs from which to train the model.

Returns:
self
class mokapot.model.PercolatorModel(scaler=None, train_fdr=0.01, max_iter=10, direction=None, override=False, subset_max_train=None, n_jobs=-1, rng=None)[source]

A model that emulates Percolator. Create linear support vector machine (SVM) model that is similar to the one used by Percolator. This is the default model used by mokapot.

Parameters:
scalerscaler object or “as-is”, optional

Defines how features are normalized before model fitting and prediction. The default, None, subtracts the mean and scales to unit variance using sklearn.preprocessing.StandardScaler. Other scalers should follow the scikit-learn transformer interface , implementing fit_transform() and transform() methods. Alternatively, the string "as-is" leaves the features in their original scale.

train_fdrfloat, optional

The maximum false discovery rate at which to consider a target PSM as a positive example.

max_iterint, optional

The number of iterations to perform.

directionstr or None, optional

The name of the feature to use as the initial direction for ranking PSMs. The default, None, automatically selects the feature that finds the most PSMs below the train_fdr. This will be ignored in the case the model is already trained.

overridebool, optional

If the learned model performs worse than the best feature, should the model still be used?

subset_max_trainint or None, optional

Use only a random subset of the PSMs for training. This is useful for very large datasets or models that scale poorly with the number of PSMs. The default, None will use all of the PSMs.

n_jobsint, optional

The number of jobs used to parallelize the hyperparameter grid search.

rngint or numpy.random.Generator, optional

The seed or generator used for model training.

Attributes:
estimatorclassifier object

The classifier used to re-score PSMs.

scalerscaler object

The scaler used to normalize features.

featureslist of str or None

The name of the features used to fit the model. None if the model has yet to be trained.

is_trainedbool

Indicates if the model has been trained.

train_fdrfloat

The maximum false discovery rate at which to consider a target PSM as a positive example.

max_iterint

The number of iterations to perform.

directionstr or None

The name of the feature to use as the initial direction for ranking PSMs.

overridebool

If the learned model performs worse than the best feature, should the model still be used?

subset_max_trainint or None

The number of PSMs for training.

n_jobsint

The number of jobs to use for parallizing the hyperparameter grid search.

rngnumpy.random.Generator

The random number generator for model training.

Methods

decision_function(psms)

Score a collection of PSMs

fit(psms)

Fit the model using the Percolator algorithm.

predict(psms)

Alias for decision_function().

save(out_file)

Save the model to a file.

decision_function(psms)

Score a collection of PSMs

Parameters:
psmsPsmDataset object

A collection of PSMs to score.

Returns:
numpy.ndarray

A numpy.ndarray containing the score for each PSM.

fit(psms)

Fit the model using the Percolator algorithm.

The model if trained by iteratively learning to separate decoy PSMs from high-scoring target PSMs. By default, an initial direction is chosen as the feature that best separates target from decoy PSMs. A false discovery rate threshold is used to define how high a target must score to be used as a positive example in the next training iteration.

Parameters:
psmsPsmDataset object

A collection of PSMs from which to train the model.

Returns:
self
predict(psms)

Alias for decision_function().

property rng

The random number generator for model training.

save(out_file)

Save the model to a file.

Parameters:
out_filestr

The name of the file for the saved model.

Returns:
str

The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

mokapot.model.save_model(model, out_file)[source]

Save a mokapot.model.Model object to a file.

Parameters:
out_filestr

The name of the file for the saved model.

Returns:
str

The output file name.

Notes

Because classes may change between mokapot and scikit-learn versions, a saved model may not work when either is changed from the version that created the model.

mokapot.model.load_model(model_file)[source]

Load a saved model for mokapot.

The saved model can either be a saved Model object or the output model weights from Percolator. In Percolator, these can be obtained using the --weights argument.

Parameters:
model_filestr

The name of file from which to load the model.

Returns:
mokapot.model.Model

The loaded mokapot.model.Model object.

Warning

Unpickling data in Python is unsafe. Make sure that the model is from a source that you trust.