Training
========

learnMSA runs a gradient-based training in order to find the best possible HMM
for aligning the sequences.
Automatically detects suitable hyperparameters for this training.
In some scenarios direct control over the training regime can be beneficial.
Possible reasons to adjust the training parameters are:

- The training fails due to memory issues or is too slow.
- The input sequences are very easy and the training should be faster.
- The time for alignment is not critical and best possible accuracy is desired.

Arguments
---------

``-n / --num_model`` *NUM_MODEL*
    This option controls how many models are trained in parallel. The models
differ slighly in their initialization, length (number of match states) and
    mini-batches seen during training.
    learnMSA automatically selects the best model according to the Akaike
    Information Criterion (AIC) after training.

    Increase this option for the potential to gain accuracy at the cost of
    longer training times and higher memory consumption.
    Reduce this option when you have limited GPU memory or want to speed up
    training.

    Default: 4

``-b / --batch`` *BATCH_SIZE*
    Controls the batch size used during training, i.e. how many sequences are
    shown to each model per training step. The optimal batch size depends on the
    length of the input sequences and the available GPU memory.
    Increase this value to speed up training.
    Reduce this value if you run out of GPU memory.

    Default: adaptive (typically 64–512, based on proteins and model size).

``--learning_rate`` *FLOAT*
    The learning rate used during gradient descent.

    Default: 0.05 if ``--use_language_model`` is set, otherwise 0.1.

``--epochs`` *EPOCHS [EPOCHS ...]*
    Scheme for the number of training epochs during the first, an intermediate,
    and the last iteration. Provide either a single integer (used for all
    iterations) or 3 integers (for first, intermediate, and last iteration).

    Default: [10, 2, 10]

``--max_iterations`` *MAX_ITERATIONS*
    Maximum number of training iterations. If greater than 2, model surgery
    will be applied.

    Default: 2

``--length_init_quantile`` *LENGTH_INIT_QUANTILE*
    Quantile of the input sequence lengths that defines the initial model
    lengths.

    Default: 0.5

``--surgery_quantile`` *SURGERY_QUANTILE*
    learnMSA will not use sequences shorter than this quantile for training
    during all iterations except the last.

    Default: 0.5

``--min_surgery_seqs`` *MIN_SURGERY_SEQS*
    Minimum number of sequences used per iteration. Overshadows the effect
    of ``--surgery_quantile``.

    Default: 100000

``--len_mul`` *LEN_MUL*
    Multiplicative constant for the quantile used to define the initial model
    length (see ``--length_init_quantile``).

    Default: 0.8

``--surgery_del`` *SURGERY_DEL*
    Will discard match states that are expected less often than this fraction.

    Default: 0.5

``--surgery_ins`` *SURGERY_INS*
    Will expand insertions that are expected more often than this fraction.

    Default: 0.5

``--model_criterion`` *MODEL_CRITERION*
    Criterion for model selection.

    Default: AIC

``--indexed_data``
    Don't load all data into memory at once at the cost of training time.

``--unaligned_insertions``
    Insertions will be left unaligned.

``--crop`` *CROP*
    During training, sequences longer than the given value will be cropped
    randomly. Reduces training runtime and memory usage, but might produce
    inaccurate results if too much of the sequences is cropped. The output
    alignment will not be cropped. Can be set to ``auto`` in which case
    sequences longer than 3 times the average length are cropped. Can be set
    to ``disable``.

    Default: auto

``--auto_crop_scale`` *AUTO_CROP_SCALE*
    During training sequences longer than this factor times the average length
    are cropped.

    Default: 2.0

``--frozen_insertions``
    Insertions will be frozen during training.

``--no_sequence_weights``
    Do not use sequence weights and strip mmseqs2 from requirements. In general
    not recommended.

``--skip_training``
    Skips the training phase entirely and only decodes an alignment from the provided
    model. This is useful if a pre-trained model is provided via ``--load_model``
    or the model is initialized from an existing MSA via the option ``--init_msa``.


Practical tips and example commands
-----------------------------------

Basic Usage
^^^^^^^^^^^

Standard MSA in a2m format (``--use_language_model`` is recommended but not required):

.. code-block:: bash

   learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model

Simple alignment without language model:

.. code-block:: bash

   learnMSA -i INPUT_FILE -o OUTPUT_FILE


Training Configuration
^^^^^^^^^^^^^^^^^^^^^^

**Quick alignment without model surgery:**

For faster results can be obtained by skipping model surgery:

.. code-block:: bash

   learnMSA -i INPUT_FILE -o OUTPUT_FILE --max_iterations 1

**High-quality alignment with more models:**

For maximum accuracy, train more models and use more iterations (requires more GPU memory and time):

.. code-block:: bash

   learnMSA -i INPUT_FILE -o OUTPUT_FILE \
       --use_language_model \
       -n 10 \
       --max_iterations 3

**Custom epoch scheme:**

Use different numbers of epochs for first, intermediate, and last iterations:

.. code-block:: bash

   learnMSA -i INPUT_FILE -o OUTPUT_FILE --epochs 20 3 20


Memory and Performance Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Limited GPU memory:**

Reduce batch size and number of models, for example:

.. code-block:: bash

   learnMSA -i INPUT_FILE -o OUTPUT_FILE -n 2 -b 32