Training

learnMSA runs a gradient-based training in order to find the best possible HMM for aligning the sequences. Automatically detects suitable hyperparameters for this training. In some scenarios direct control over the training regime can be beneficial. Possible reasons to adjust the training parameters are:

  • The training fails due to memory issues or is too slow.

  • The input sequences are very easy and the training should be faster.

  • The time for alignment is not critical and best possible accuracy is desired.

Arguments

-n / --num_model NUM_MODEL

This option controls how many models are trained in parallel. The models

differ slighly in their initialization, length (number of match states) and

mini-batches seen during training. learnMSA automatically selects the best model according to the Akaike Information Criterion (AIC) after training.

Increase this option for the potential to gain accuracy at the cost of longer training times and higher memory consumption. Reduce this option when you have limited GPU memory or want to speed up training.

Default: 4

-b / --batch BATCH_SIZE

Controls the batch size used during training, i.e. how many sequences are shown to each model per training step. The optimal batch size depends on the length of the input sequences and the available GPU memory. Increase this value to speed up training. Reduce this value if you run out of GPU memory.

Default: adaptive (typically 64–512, based on proteins and model size).

--learning_rate FLOAT

The learning rate used during gradient descent.

Default: 0.05 if --use_language_model is set, otherwise 0.1.

--epochs EPOCHS [EPOCHS …]

Scheme for the number of training epochs during the first, an intermediate, and the last iteration. Provide either a single integer (used for all iterations) or 3 integers (for first, intermediate, and last iteration).

Default: [10, 2, 10]

--max_iterations MAX_ITERATIONS

Maximum number of training iterations. If greater than 2, model surgery will be applied.

Default: 2

--length_init_quantile LENGTH_INIT_QUANTILE

Quantile of the input sequence lengths that defines the initial model lengths.

Default: 0.5

--surgery_quantile SURGERY_QUANTILE

learnMSA will not use sequences shorter than this quantile for training during all iterations except the last.

Default: 0.5

--min_surgery_seqs MIN_SURGERY_SEQS

Minimum number of sequences used per iteration. Overshadows the effect of --surgery_quantile.

Default: 100000

--len_mul LEN_MUL

Multiplicative constant for the quantile used to define the initial model length (see --length_init_quantile).

Default: 0.8

--surgery_del SURGERY_DEL

Will discard match states that are expected less often than this fraction.

Default: 0.5

--surgery_ins SURGERY_INS

Will expand insertions that are expected more often than this fraction.

Default: 0.5

--model_criterion MODEL_CRITERION

Criterion for model selection.

Default: AIC

--indexed_data

Don’t load all data into memory at once at the cost of training time.

--unaligned_insertions

Insertions will be left unaligned.

--crop CROP

During training, sequences longer than the given value will be cropped randomly. Reduces training runtime and memory usage, but might produce inaccurate results if too much of the sequences is cropped. The output alignment will not be cropped. Can be set to auto in which case sequences longer than 3 times the average length are cropped. Can be set to disable.

Default: auto

--auto_crop_scale AUTO_CROP_SCALE

During training sequences longer than this factor times the average length are cropped.

Default: 2.0

--frozen_insertions

Insertions will be frozen during training.

--no_sequence_weights

Do not use sequence weights and strip mmseqs2 from requirements. In general not recommended.

--skip_training

Skips the training phase entirely and only decodes an alignment from the provided model. This is useful if a pre-trained model is provided via --load_model or the model is initialized from an existing MSA via the option --init_msa.

Practical tips and example commands

Basic Usage

Standard MSA in a2m format (--use_language_model is recommended but not required):

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model

Simple alignment without language model:

learnMSA -i INPUT_FILE -o OUTPUT_FILE

Training Configuration

Quick alignment without model surgery:

For faster results can be obtained by skipping model surgery:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --max_iterations 1

High-quality alignment with more models:

For maximum accuracy, train more models and use more iterations (requires more GPU memory and time):

learnMSA -i INPUT_FILE -o OUTPUT_FILE \
    --use_language_model \
    -n 10 \
    --max_iterations 3

Custom epoch scheme:

Use different numbers of epochs for first, intermediate, and last iterations:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --epochs 20 3 20

Memory and Performance Optimization

Limited GPU memory:

Reduce batch size and number of models, for example:

learnMSA -i INPUT_FILE -o OUTPUT_FILE -n 2 -b 32