Protein Language Model Integration

learnMSA can leverage large protein language models to generate per-token embeddings that guide the multiple sequence alignment process. This integration can significantly improve alignment quality, especially for distantly related sequences.

Arguments

--use_language_model

Uses a large protein language model to generate per-token embeddings that guide the MSA step. It is recommended to always use this option, unless computational resources are limited.

--plm_cache_dir PLM_CACHE_DIR

Directory where the protein language model is stored.

Default: learnMSA install directory

--language_model LANGUAGE_MODEL

Name of the language model to use. Possible values are protT5, esm2 and proteinBERT.

Default: protT5

Usage Example

To use protein language model integration with default settings:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model

To run a different language model:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model --language_model esm2

To specify a custom cache directory and language model:

learnMSA -i INPUT_FILE -o OUTPUT_FILE \
    --use_language_model \
    --plm_cache_dir /path/to/cache \
    --language_model protT5