Cogent3 refactor

A re-write of the cogent3 core data types

Over the past year, I’ve been involved in a major rewrite of cogent3’s core internals, specifically reimagining how we represent multiple sequence alignments 🧬

Previously, cogent3 had two classes for representing Aligmnments. One was based on an Numpy array representation of an alignment and the other was a dictionary of Aligned objects representing each sequence in the alignment. The former was faster, more memory-efficient, and coped with larger datasets, while the latter was able to maintain the mapping between the coordinates of the alignment and the original sequences – thereby permiting researcher to harness information from genomic annotations into their analyses. The objective of the refactor was simple: to combine the best features of both classes into a single, more flexible, and more performant class.

In the mid-2024 release, the Sequence, SequenceCollection, MolType, GeneticCode, and alphabet classes were all rewritten from scratch with an eye to simplifying the code while improving its flexibility and performance.

In the final release of 2024 we extended this to the Alignments. The updated Alignment class keeps the best features of the old design, while laying the groundwork for significant performance improvements 🚀

The “new-style” objects enhance performance by supporting the access of the underlying data in various formats (i.e., numpy arrays, bytes, or strings). You can create “new-style” objects by setting the new_type=True argument in top-level functions (make_seq, load_seq, make_unaligned_seqs, get_moltype, get_code). These are not yet the default and are not fully integrated into the existing code. They can also differ in their API relative to the classes they replace.

Curious about cogent3? Check out our documentation at cogent3.org.