Training
learnMSA runs a gradient-based training in order to find the best possible HMM for aligning the sequences. Automatically detects suitable hyperparameters for this training. In some scenarios direct control over the training regime can be beneficial. Possible reasons to adjust the training parameters are:
The training fails due to memory issues or is too slow.
The input sequences are very easy and the training should be faster.
The time for alignment is not critical and best possible accuracy is desired.
Arguments
-n / --num_modelNUM_MODELThis option controls how many models are trained in parallel. The models
- differ slighly in their initialization, length (number of match states) and
mini-batches seen during training. learnMSA automatically selects the best model according to the Akaike Information Criterion (AIC) after training.
Increase this option for the potential to gain accuracy at the cost of longer training times and higher memory consumption. Reduce this option when you have limited GPU memory or want to speed up training.
Default: 4
-b / --batchBATCH_SIZEControls the batch size used during training, i.e. how many sequences are shown to each model per training step. The optimal batch size depends on the length of the input sequences and the available GPU memory. Increase this value to speed up training. Reduce this value if you run out of GPU memory.
Default: adaptive (typically 64–512, based on proteins and model size).
--learning_rateFLOATThe learning rate used during gradient descent.
Default: 0.05 if
--use_language_modelis set, otherwise 0.1.--epochsEPOCHS [EPOCHS …]Scheme for the number of training epochs during the first, an intermediate, and the last iteration. Provide either a single integer (used for all iterations) or 3 integers (for first, intermediate, and last iteration).
Default: [10, 2, 10]
--max_iterationsMAX_ITERATIONSMaximum number of training iterations. If greater than 2, model surgery will be applied.
Default: 2
--length_init_quantileLENGTH_INIT_QUANTILEQuantile of the input sequence lengths that defines the initial model lengths.
Default: 0.5
--surgery_quantileSURGERY_QUANTILElearnMSA will not use sequences shorter than this quantile for training during all iterations except the last.
Default: 0.5
--min_surgery_seqsMIN_SURGERY_SEQSMinimum number of sequences used per iteration. Overshadows the effect of
--surgery_quantile.Default: 100000
--len_mulLEN_MULMultiplicative constant for the quantile used to define the initial model length (see
--length_init_quantile).Default: 0.8
--surgery_delSURGERY_DELWill discard match states that are expected less often than this fraction.
Default: 0.5
--surgery_insSURGERY_INSWill expand insertions that are expected more often than this fraction.
Default: 0.5
--model_criterionMODEL_CRITERIONCriterion for model selection.
Default: AIC
--indexed_dataDon’t load all data into memory at once at the cost of training time.
--unaligned_insertionsInsertions will be left unaligned.
--cropCROPDuring training, sequences longer than the given value will be cropped randomly. Reduces training runtime and memory usage, but might produce inaccurate results if too much of the sequences is cropped. The output alignment will not be cropped. Can be set to
autoin which case sequences longer than 3 times the average length are cropped. Can be set todisable.Default: auto
--auto_crop_scaleAUTO_CROP_SCALEDuring training sequences longer than this factor times the average length are cropped.
Default: 2.0
--frozen_insertionsInsertions will be frozen during training.
--no_sequence_weightsDo not use sequence weights and strip mmseqs2 from requirements. In general not recommended.
--skip_trainingSkips the training phase entirely and only decodes an alignment from the provided model. This is useful if a pre-trained model is provided via
--load_modelor the model is initialized from an existing MSA via the option--init_msa.
Practical tips and example commands
Basic Usage
Standard MSA in a2m format (--use_language_model is recommended but not required):
learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model
Simple alignment without language model:
learnMSA -i INPUT_FILE -o OUTPUT_FILE
Training Configuration
Quick alignment without model surgery:
For faster results can be obtained by skipping model surgery:
learnMSA -i INPUT_FILE -o OUTPUT_FILE --max_iterations 1
High-quality alignment with more models:
For maximum accuracy, train more models and use more iterations (requires more GPU memory and time):
learnMSA -i INPUT_FILE -o OUTPUT_FILE \
--use_language_model \
-n 10 \
--max_iterations 3
Custom epoch scheme:
Use different numbers of epochs for first, intermediate, and last iterations:
learnMSA -i INPUT_FILE -o OUTPUT_FILE --epochs 20 3 20
Memory and Performance Optimization
Limited GPU memory:
Reduce batch size and number of models, for example:
learnMSA -i INPUT_FILE -o OUTPUT_FILE -n 2 -b 32