API Reference

learnMSA: Learning and Aligning Large Protein Families with support of protein language models.

Sequence Dataset

Classes for managing sequence datasets (either unaligned or aligned).

class learnMSA.msa_hmm.SequenceDataset.SequenceDataset(filepath=None, fmt='fasta', sequences=None, indexed=False)

Bases: object

Manages a set of protein sequences.

Parameters:

filepath (Path | str | None)
fmt (str)
sequences (list[tuple[str, str]] | None)
indexed (bool)

alphabet: str = 'ARNDCQEGHILKMFPSTWYVXUO-'

close()

Return type:: None

property filepath: Path: Path to the sequence file.

property fmt: str: Format of the sequence file.

get_alphabet_no_gap()

Get the alphabet without gap characters.

Return type:: str

get_encoded_seq(i, remove_gaps=True, gap_symbols='-.', ignore_symbols='', replace_with_x='BZJ', crop_to_length=inf, validate_alphabet=True, dtype=<class 'numpy.int16'>, return_crop_boundaries=False)

Returns sequence i encoded as a numpy array of integers.

Parameters:

i (int) – Index of the sequence to process.
remove_gaps (bool) – Passed to get_standardized_seq.
gap_symbols (str) – Passed to get_standardized_seq.
ignore_symbols (str) – Passed to get_standardized_seq.
replace_with_x (str) – Passed to get_standardized_seq.
crop_to_length (float) – If the sequence is longer than this length, a random crop of this length is returned. If the sequence is shorter than this length, the whole sequence is returned.
validate_alphabet (bool) – If True, check that the sequence contains only characters from the defined alphabet.
dtype (type) – Numpy integer type to use for the encoded sequence.
return_crop_boundaries (bool) – If True, also return the start and end indices of the crop within the original sequence.

Return type:

ndarray | tuple[ndarray, int, int]

get_header(i)

Get the header/description for sequence i.

Parameters:: i (int)
Return type:: str

get_record(i)

Get the SeqRecord object for sequence i.

Parameters:: i (int)
Return type:: SeqRecord

get_standardized_seq(i, remove_gaps=True, gap_symbols='-.', ignore_symbols='', replace_with_x='')

Returns a standardized sequence string for sequence i containing only uppercase letters from the standard amino acid alphabet and either standard gap character ‘-’ or no gap characters at all.

Parameters:

i (int) – Index of the sequence to process.
remove_gaps (bool) – If True, all gap characters provided in gap_symbols are removed from the sequence. If False, all gap characters are replaced with the first character in gap_symbols.
gap_symbols (str) – String containing all characters to be treated as gap characters.
ignore_symbols (str) – String containing all characters to be ignored/removed from the sequence.
replace_with_x (str) – String containing all characters to be replaced with ‘X’ in the sequence.

property indexed: bool: Whether the dataset is indexed.

property max_len: int: Maximum sequence length in the dataset.

property num_seq: int: Total number of sequences in the dataset.

property parsing_ok: bool: Whether the dataset was parsed successfully.

property record_dict: dict[str, SeqRecord] | _IndexedSeqFileDict: Dictionary(-like) object that takes sequence IDs as keys and maps them to SeqRecord objects.

property seq_ids: list[str]: List of sequence IDs.

property seq_lens: ndarray: Lengths of the sequences in the dataset.

validate_dataset(single_seq_ok=False, empty_seq_id_ok=False, dublicate_seq_id_ok=False)

Raise an error if the dataset is not valid for processing.

Parameters:

single_seq_ok (bool)
empty_seq_id_ok (bool)
dublicate_seq_id_ok (bool)

Return type:

None

write(filepath, fmt='fasta', standardize_sequences=False)

Write the dataset to a file.

Parameters:

filepath (Path) – Path to the output file.
fmt (str) – Format of the output file. Can be any format supported by Biopython’s SeqIO.
standardize_sequences (bool) – If True, sequences are converted to uppercase and non-standard amino acids are replaced with ‘X’. Dots are replaced with dashes.

Return type:

None

class learnMSA.msa_hmm.SequenceDataset.AlignedDataset(filepath=None, fmt='fasta', aligned_sequences=None, indexed=False, single_seq_ok=False)

Bases: SequenceDataset

Manages a multiple sequence alignment.

Parameters:

filepath (Path | str | None)
fmt (str)
aligned_sequences (list[tuple[str, str]] | None)
indexed (bool)
single_seq_ok (bool)

SP_score(ref_data, batch=512)

Compute the SP-score of this alignment with respect to a reference alignment.

Parameters:

ref_data (AlignedDataset) – Reference alignment.
batch (int) – Number of sequences to process in each batch. Lower values reduce memory consumption but increase computation time.

Return type:

float

property alignment_len: int: Length of the alignment (number of columns).

property column_map: ndarray: Mapping from sequence positions to MSA-column index.

get_column_map(i)

Get the mapping from sequence positions to MSA-column index for a specific sequence.

Parameters:: i (int)
Return type:: ndarray

property msa_matrix: ndarray: MSA matrix as a 2D numpy array of shape (num_seq, alignment_len).

validate_dataset()

Raise an error if the MSA is not valid for processing.

Return type:: None