treeflow.evolution.seqio module

class treeflow.evolution.seqio.AlignmentFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum to represent the file format of a multiple sequence alignment

FASTA = 'fasta'
NEXUS = 'nexus'
NEXML = 'nexml'
PHYLIP = 'phylip'
class treeflow.evolution.seqio.AlignmentType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum to represent the character data type of a sequence alignment

NUCLEOTIDE = 'nucleotide'
PROTEIN = 'protein'
class treeflow.evolution.seqio.Alignment(fasta_file: str | bytes | PathLike | None = None, sequence_mapping: Mapping[str, Collection[str]] | None = None, format: AlignmentFormat = AlignmentFormat.FASTA, data_type: AlignmentType = AlignmentType.NUCLEOTIDE)

Bases: object

Class to represent a multiple sequence alignment

Either filename or sequence_mapping must be provided.

Parameters:
  • filename (Optional[PathLikeType]) – Filename of FASTA file that alignment is read from (optional) Filename is passed to open (so can bestring, path or buffer) (default: None)

  • sequence_mapping (Optional[Mapping[str, Collection[str]]]) – Mapping from names to sequences (optional) (default: None)

get_encoded_sequence_array(taxon_names: Iterable[str]) ndarray

Build a one-hot encoded NumPy array for the alignment according to the provided taxon ordering Currently only supports nucleotide sequences, uses ACGT ordering.

Parameters:

taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

Returns:

One-hot encoded sequence NumPy array with shape [(n_sequences, 4)]

Return type:

np.ndarray

get_codon_partitioned_sequence_array(taxon_names: Iterable[str]) ndarray

Build a one-hot encoded NumPy array for the alignment according to the provided taxon ordering, and partioned into codon positions Currently only supports nucleotide sequences, uses ACGT ordering. The codon positions are the first axis of the array. If the number of sites is not a multiple of 3, the sequences are padded with gaps.

Parameters:

taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

Returns:

One-hot encoded codon-partioned sequence NumPy array with shape [(3, n_codons, 4)]

Return type:

np.ndarray

get_encoded_sequence_tensor(taxon_names: Iterable[str], dtype: DType = tf.float64) Tensor

Build a one-hot encoded TensorFlow Tensor constant for the alignment according to the provided taxon ordering Currently only supports nucleotide sequences, uses ACGT ordering. The codon positions are the first axis of the Tensor. If the number of sites is not a multiple of 3, the sequences are padded with gaps.

Parameters:
  • taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

  • dtype (tf.DType) – TensorFlow data type for the returned array (defaults to package default)

Returns:

One-hot encoded sequence TensorFlow tensor with shape [(3, n_codons, 4)] and data dtype dtype

Return type:

tf.Tensor

get_codon_partitioned_sequence_tensor(taxon_names: Iterable[str], dtype: DType = tf.float64) Tensor

Build a one-hot encoded TensorFlow Tensor constant for the alignment according to the provided taxon ordering, and partioned into codon positions Currently only supports nucleotide sequences, uses ACGT ordering.

Parameters:
  • taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

  • dtype (tf.DType) – TensorFlow data type for the returned array (defaults to package default)

Returns:

One-hot encoded sequence TensorFlow tensor with shape [(n_sequences, 4)] and data dtype dtype

Return type:

tf.Tensor

get_compressed_alignment() WeightedAlignment

Compress an alignment by selecting sites where the mapping from taxa to characters are unique and weighting them by the number of times they occur.

Returns:

The compressed alignment

Return type:

WeightedAlignment

property taxon_count

The number of taxa included in the alignment

property pattern_count

The number of sites in the alignment

class treeflow.evolution.seqio.WeightedAlignment(pattern_mapping: Mapping[str, Collection[str]], weights: Iterable[float], data_type: AlignmentType = AlignmentType.NUCLEOTIDE)

Bases: Alignment

Class to represent a multiple sequence alignment with numeric weights associated with the sites

Parameters:
  • sequence_mapping (Optional[Mapping[str, Collection[str]]]) – Mapping from names to sequences (optional) (default: None)

  • weights (Iterable[float]) – Weights associated with positions in the sequences

get_codon_partitioned_sequence_array(taxon_names: Iterable[str]) ndarray

Build a one-hot encoded NumPy array for the alignment according to the provided taxon ordering, and partioned into codon positions Currently only supports nucleotide sequences, uses ACGT ordering. The codon positions are the first axis of the array. If the number of sites is not a multiple of 3, the sequences are padded with gaps.

Parameters:

taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

Returns:

One-hot encoded codon-partioned sequence NumPy array with shape [(3, n_codons, 4)]

Return type:

np.ndarray

get_codon_partitioned_sequence_tensor(taxon_names: Iterable[str], dtype: DType = tf.float64) Tensor

Build a one-hot encoded TensorFlow Tensor constant for the alignment according to the provided taxon ordering, and partioned into codon positions Currently only supports nucleotide sequences, uses ACGT ordering.

Parameters:
  • taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

  • dtype (tf.DType) – TensorFlow data type for the returned array (defaults to package default)

Returns:

One-hot encoded sequence TensorFlow tensor with shape [(n_sequences, 4)] and data dtype dtype

Return type:

tf.Tensor

get_compressed_alignment() WeightedAlignment

Compress an alignment by selecting sites where the mapping from taxa to characters are unique and weighting them by the number of times they occur.

Returns:

The compressed alignment

Return type:

WeightedAlignment

get_encoded_sequence_array(taxon_names: Iterable[str]) ndarray

Build a one-hot encoded NumPy array for the alignment according to the provided taxon ordering Currently only supports nucleotide sequences, uses ACGT ordering.

Parameters:

taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

Returns:

One-hot encoded sequence NumPy array with shape [(n_sequences, 4)]

Return type:

np.ndarray

get_encoded_sequence_tensor(taxon_names: Iterable[str], dtype: DType = tf.float64) Tensor

Build a one-hot encoded TensorFlow Tensor constant for the alignment according to the provided taxon ordering Currently only supports nucleotide sequences, uses ACGT ordering. The codon positions are the first axis of the Tensor. If the number of sites is not a multiple of 3, the sequences are padded with gaps.

Parameters:
  • taxon_names (Iterable[str]) – Order of taxa to use in the encoded array

  • dtype (tf.DType) – TensorFlow data type for the returned array (defaults to package default)

Returns:

One-hot encoded sequence TensorFlow tensor with shape [(3, n_codons, 4)] and data dtype dtype

Return type:

tf.Tensor

get_weights_array() ndarray

Get the site weights as a NumPy array

Returns:

Site weights array

Return type:

np.ndarray

property pattern_count

The number of sites in the alignment

property taxon_count

The number of taxa included in the alignment

get_weights_tensor(dtype=tf.float64) Tensor

Get the site weights as a TensorFlow Tensor

Parameters:

dtype (tf.DType) – TensorFlow data type for the returned array (defaults to package default)

Returns:

Site weights constant Tensor

Return type:

tf.Tensor