Input/output and general control

Arguments

-i / --in_file INPUT_FILE

Input fasta file containing the protein sequences to align. Any gaps present in the input sequences are ignored. learnMSA uses the alphabet ARNDCQEGHILKMFPSTWYVXUO. Special characters B, Z, J are mapped to X. The sequences must not contain any other non-standard characters.

-o / --out_file OUTPUT_FILE

Output file path for the resulting multiple sequence alignment. Use -f to change the output file type.

-f / --format FORMAT

Format of the output alignment file. Per default, learnMSA outputs alignments in a2m format. This format is closely related to fasta and usually compatible with fasta parsers. In addition to fasta, a2m uses lower case letters to indicate insertions with respect to the profile HMM and uses dots (.) to represent an insertion in other sequences at the same position. It uses upper case letters for match states and dashes (-) for deletions. The format can be set to “fasta”, which uses only standard dashes and upper case letters. Use --convert to quickly convert between different formats. The a2m and fasta options use a maximum line length of 80 characters.

Furthermore, this option be set to any valid Biopython SeqIO format, in which case learnMSA will write a fasta file and automatically converts it. This is not recommended for large alignments, as output files can be very large and the file contents can not be streamed.

Default: a2m (fasta).

--convert MSA_FILE

With this option, learnMSA does not perform any alignment, but only converts the input MSA to the format specified with -f.

-s / --silent

Suppresses all standard output messages.

-d / --cuda_visible_devices

Controls the GPU devices visible to learnMSA as a comma- separated list of device IDs. The value -1 forces learnMSA to run on CPU. Per default, learnMSA attempts to use all available GPUs. Use -d i to use a specific GPU, where i is the GPU ID starting from 0.

--work_dir WORK_DIR

Directory where any secondary files are stored.

Default: ./tmp

--save_model MODEL_FILE

If set, the trained model parameters will be saved to the specified file. The file format is meant to be read with the --load_model option only.

--load_model MODEL_FILE

If set, learnMSA will load the model parameters from the specified file and use them as initialization for training. Use the --skip_training option to directly align the input sequences without further training.

Practical tips and example commands

Standard MSA in a2m format (--use_language_model is recommended but not required):

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model

Enforce fasta output if a2m leads to compatibility issues:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model -f fasta

To control where learnMSA writes temporary files, use the --work_dir option. In particular, this is useful when aligning the same input file multiple times in parallel, to avoid conflicts between different runs:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model --work_dir ./my_temp_dir

To save a trained model for later reuse, use the --save_model option:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --use_language_model --save_model my_model

This can be useful to reproduce alignments later on or to resume a training.

To load a previously saved model, use the --load_model option. You may combine it with --skip_training to directly align the input sequences without further training:

learnMSA -i INPUT_FILE -o OUTPUT_FILE --load_model my_model --skip_training