API Reference

learnMSA: Learning and Aligning Large Protein Families with support of protein language models.

Sequence Dataset

Classes for managing sequence datasets (either unaligned or aligned).

class learnMSA.msa_hmm.SequenceDataset.SequenceDataset(filepath=None, fmt='fasta', sequences=None, indexed=False)

Bases: object

Manages a set of protein sequences.

Parameters:
  • filepath (Path | str | None)

  • fmt (str)

  • sequences (list[tuple[str, str]] | None)

  • indexed (bool)

alphabet: str = 'ARNDCQEGHILKMFPSTWYVXUO-'
close()
Return type:

None

property filepath: Path

Path to the sequence file.

property fmt: str

Format of the sequence file.

get_alphabet_no_gap()

Get the alphabet without gap characters.

Return type:

str

get_encoded_seq(i, remove_gaps=True, gap_symbols='-.', ignore_symbols='', replace_with_x='BZJ', crop_to_length=inf, validate_alphabet=True, dtype=<class 'numpy.int16'>, return_crop_boundaries=False)

Returns sequence i encoded as a numpy array of integers.

Parameters:
  • i (int) – Index of the sequence to process.

  • remove_gaps (bool) – Passed to get_standardized_seq.

  • gap_symbols (str) – Passed to get_standardized_seq.

  • ignore_symbols (str) – Passed to get_standardized_seq.

  • replace_with_x (str) – Passed to get_standardized_seq.

  • crop_to_length (float) – If the sequence is longer than this length, a random crop of this length is returned. If the sequence is shorter than this length, the whole sequence is returned.

  • validate_alphabet (bool) – If True, check that the sequence contains only characters from the defined alphabet.

  • dtype (type) – Numpy integer type to use for the encoded sequence.

  • return_crop_boundaries (bool) – If True, also return the start and end indices of the crop within the original sequence.

Return type:

ndarray | tuple[ndarray, int, int]

get_header(i)

Get the header/description for sequence i.

Parameters:

i (int)

Return type:

str

get_record(i)

Get the SeqRecord object for sequence i.

Parameters:

i (int)

Return type:

SeqRecord

get_standardized_seq(i, remove_gaps=True, gap_symbols='-.', ignore_symbols='', replace_with_x='')

Returns a standardized sequence string for sequence i containing only uppercase letters from the standard amino acid alphabet and either standard gap character ‘-’ or no gap characters at all.

Parameters:
  • i (int) – Index of the sequence to process.

  • remove_gaps (bool) – If True, all gap characters provided in gap_symbols are removed from the sequence. If False, all gap characters are replaced with the first character in gap_symbols.

  • gap_symbols (str) – String containing all characters to be treated as gap characters.

  • ignore_symbols (str) – String containing all characters to be ignored/removed from the sequence.

  • replace_with_x (str) – String containing all characters to be replaced with ‘X’ in the sequence.

property indexed: bool

Whether the dataset is indexed.

property max_len: int

Maximum sequence length in the dataset.

property num_seq: int

Total number of sequences in the dataset.

property parsing_ok: bool

Whether the dataset was parsed successfully.

property record_dict: dict[str, SeqRecord] | _IndexedSeqFileDict

Dictionary(-like) object that takes sequence IDs as keys and maps them to SeqRecord objects.

property seq_ids: list[str]

List of sequence IDs.

property seq_lens: ndarray

Lengths of the sequences in the dataset.

validate_dataset(single_seq_ok=False, empty_seq_id_ok=False, dublicate_seq_id_ok=False)

Raise an error if the dataset is not valid for processing.

Parameters:
  • single_seq_ok (bool)

  • empty_seq_id_ok (bool)

  • dublicate_seq_id_ok (bool)

Return type:

None

write(filepath, fmt='fasta', standardize_sequences=False)

Write the dataset to a file.

Parameters:
  • filepath (Path) – Path to the output file.

  • fmt (str) – Format of the output file. Can be any format supported by Biopython’s SeqIO.

  • standardize_sequences (bool) – If True, sequences are converted to uppercase and non-standard amino acids are replaced with ‘X’. Dots are replaced with dashes.

Return type:

None

class learnMSA.msa_hmm.SequenceDataset.AlignedDataset(filepath=None, fmt='fasta', aligned_sequences=None, indexed=False, single_seq_ok=False)

Bases: SequenceDataset

Manages a multiple sequence alignment.

Parameters:
  • filepath (Path | str | None)

  • fmt (str)

  • aligned_sequences (list[tuple[str, str]] | None)

  • indexed (bool)

  • single_seq_ok (bool)

SP_score(ref_data, batch=512)

Compute the SP-score of this alignment with respect to a reference alignment.

Parameters:
  • ref_data (AlignedDataset) – Reference alignment.

  • batch (int) – Number of sequences to process in each batch. Lower values reduce memory consumption but increase computation time.

Return type:

float

property alignment_len: int

Length of the alignment (number of columns).

property column_map: ndarray

Mapping from sequence positions to MSA-column index.

get_column_map(i)

Get the mapping from sequence positions to MSA-column index for a specific sequence.

Parameters:

i (int)

Return type:

ndarray

property msa_matrix: ndarray

MSA matrix as a 2D numpy array of shape (num_seq, alignment_len).

validate_dataset()

Raise an error if the MSA is not valid for processing.

Return type:

None