diverse-seq: an application for alignment-free selecting and clustering biological sequences
The algorithms required for phylogenetics — multiple sequence alignment and phylogeny estimation — are both compute intensive. As the size of DNA sequence datasets continues to increase, there is a need for a tool that can effectively lessen the computational burden associated with this widely used analysis.diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.The diverse-seq algorithms are not limited to homologous sequences. As such, they can improve the performance of other workflows. For instance, machine learning projects that involve non-homologous sequences can benefit as representative sampling can mitigate biases from imbalanced groups.diverse-seq is a BSD-3 licensed Python package that provides both a command-line interface and cogent3 plugins. The latter simplifies integration by users into their own analyses. It is available via the Python Package Index and GitHub.Statement of need Accurately selecting a representative subset of biological sequences can improve the statistical accuracy and computational performance of data sampling workflows. In many cases, the reliability of such analyses is contingent on the sample capturing the full diversity of the original collection (e.g. estimating large phylogenies Parks et al., 2018; Zhu et al., 2019). Additionally, the computation time of algorithms reliant on numerical optimisation, such as phylogenetic estimation, can be markedly reduced by having a good initial estimate.Existing tools to the data sampling problem require input data in formats that themselves can be computationally costly to acquire. For instance, tree-based sequence selection procedures can be efficient, but they rely on a phylogenetic tree or a pairwise genetic distance matrix, both of which require alignment of homologous sequences (Balaban et al., 2019; e.g. Widmann et al., 2006). Adding both the time for sequence alignment and tree estimation presents a barrier to their use.The diverse-seq sequence selection algorithms are linear in time for the number of sequences and more flexible than published approaches. While the algorithms do not require sequences to be homologous, when applied to homologous sequences, the set selected is comparable to what would be expected based on genetic distance. The diverse-seq clustering algorithm is linear in time for the combined sequence length. For homologous sequences, the estimated trees are approximations to that estimated from an alignment by IQ-TREE2 (Minh et al., 2020).Competing Interest StatementThe authors have declared no competing interest.