API Reference
learnMSA: Learning and Aligning Large Protein Families with support of protein language models.
Sequence Dataset
Classes for managing sequence datasets (either unaligned or aligned).
- class learnMSA.msa_hmm.SequenceDataset.SequenceDataset(filepath=None, fmt='fasta', sequences=None, indexed=False)
Bases:
objectManages a set of protein sequences.
- Parameters:
filepath (Path | str | None)
fmt (str)
sequences (list[tuple[str, str]] | None)
indexed (bool)
- alphabet: str = 'ARNDCQEGHILKMFPSTWYVXUO-'
- close()
- Return type:
None
- property filepath: Path
Path to the sequence file.
- property fmt: str
Format of the sequence file.
- get_alphabet_no_gap()
Get the alphabet without gap characters.
- Return type:
str
- get_encoded_seq(i, remove_gaps=True, gap_symbols='-.', ignore_symbols='', replace_with_x='BZJ', crop_to_length=inf, validate_alphabet=True, dtype=<class 'numpy.int16'>, return_crop_boundaries=False)
Returns sequence i encoded as a numpy array of integers.
- Parameters:
i (int) – Index of the sequence to process.
remove_gaps (bool) – Passed to get_standardized_seq.
gap_symbols (str) – Passed to get_standardized_seq.
ignore_symbols (str) – Passed to get_standardized_seq.
replace_with_x (str) – Passed to get_standardized_seq.
crop_to_length (float) – If the sequence is longer than this length, a random crop of this length is returned. If the sequence is shorter than this length, the whole sequence is returned.
validate_alphabet (bool) – If True, check that the sequence contains only characters from the defined alphabet.
dtype (type) – Numpy integer type to use for the encoded sequence.
return_crop_boundaries (bool) – If True, also return the start and end indices of the crop within the original sequence.
- Return type:
ndarray | tuple[ndarray, int, int]
- get_header(i)
Get the header/description for sequence i.
- Parameters:
i (int)
- Return type:
str
- get_record(i)
Get the SeqRecord object for sequence i.
- Parameters:
i (int)
- Return type:
SeqRecord
- get_standardized_seq(i, remove_gaps=True, gap_symbols='-.', ignore_symbols='', replace_with_x='')
Returns a standardized sequence string for sequence i containing only uppercase letters from the standard amino acid alphabet and either standard gap character ‘-’ or no gap characters at all.
- Parameters:
i (int) – Index of the sequence to process.
remove_gaps (bool) – If True, all gap characters provided in gap_symbols are removed from the sequence. If False, all gap characters are replaced with the first character in gap_symbols.
gap_symbols (str) – String containing all characters to be treated as gap characters.
ignore_symbols (str) – String containing all characters to be ignored/removed from the sequence.
replace_with_x (str) – String containing all characters to be replaced with ‘X’ in the sequence.
- property indexed: bool
Whether the dataset is indexed.
- property max_len: int
Maximum sequence length in the dataset.
- property num_seq: int
Total number of sequences in the dataset.
- property parsing_ok: bool
Whether the dataset was parsed successfully.
- property record_dict: dict[str, SeqRecord] | _IndexedSeqFileDict
Dictionary(-like) object that takes sequence IDs as keys and maps them to SeqRecord objects.
- property seq_ids: list[str]
List of sequence IDs.
- property seq_lens: ndarray
Lengths of the sequences in the dataset.
- validate_dataset(single_seq_ok=False, empty_seq_id_ok=False, dublicate_seq_id_ok=False)
Raise an error if the dataset is not valid for processing.
- Parameters:
single_seq_ok (bool)
empty_seq_id_ok (bool)
dublicate_seq_id_ok (bool)
- Return type:
None
- write(filepath, fmt='fasta', standardize_sequences=False)
Write the dataset to a file.
- Parameters:
filepath (Path) – Path to the output file.
fmt (str) – Format of the output file. Can be any format supported by Biopython’s SeqIO.
standardize_sequences (bool) – If True, sequences are converted to uppercase and non-standard amino acids are replaced with ‘X’. Dots are replaced with dashes.
- Return type:
None
- class learnMSA.msa_hmm.SequenceDataset.AlignedDataset(filepath=None, fmt='fasta', aligned_sequences=None, indexed=False, single_seq_ok=False)
Bases:
SequenceDatasetManages a multiple sequence alignment.
- Parameters:
filepath (Path | str | None)
fmt (str)
aligned_sequences (list[tuple[str, str]] | None)
indexed (bool)
single_seq_ok (bool)
- SP_score(ref_data, batch=512)
Compute the SP-score of this alignment with respect to a reference alignment.
- Parameters:
ref_data (AlignedDataset) – Reference alignment.
batch (int) – Number of sequences to process in each batch. Lower values reduce memory consumption but increase computation time.
- Return type:
float
- property alignment_len: int
Length of the alignment (number of columns).
- property column_map: ndarray
Mapping from sequence positions to MSA-column index.
- get_column_map(i)
Get the mapping from sequence positions to MSA-column index for a specific sequence.
- Parameters:
i (int)
- Return type:
ndarray
- property msa_matrix: ndarray
MSA matrix as a 2D numpy array of shape (num_seq, alignment_len).
- validate_dataset()
Raise an error if the MSA is not valid for processing.
- Return type:
None