publications
publications in reversed chronological order.
2024
- Under Reviewdiverse-seq: an application for alignment-free selecting and clustering biological sequencesGavin Huttley, Katherine Caley, and Robert McArthurbioRxiv, 2024
The algorithms required for phylogenetics — multiple sequence alignment and phylogeny estimation — are both compute intensive. As the size of DNA sequence datasets continues to increase, there is a need for a tool that can effectively lessen the computational burden associated with this widely used analysis.diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.The diverse-seq algorithms are not limited to homologous sequences. As such, they can improve the performance of other workflows. For instance, machine learning projects that involve non-homologous sequences can benefit as representative sampling can mitigate biases from imbalanced groups.diverse-seq is a BSD-3 licensed Python package that provides both a command-line interface and cogent3 plugins. The latter simplifies integration by users into their own analyses. It is available via the Python Package Index and GitHub.Statement of need Accurately selecting a representative subset of biological sequences can improve the statistical accuracy and computational performance of data sampling workflows. In many cases, the reliability of such analyses is contingent on the sample capturing the full diversity of the original collection (e.g. estimating large phylogenies Parks et al., 2018; Zhu et al., 2019). Additionally, the computation time of algorithms reliant on numerical optimisation, such as phylogenetic estimation, can be markedly reduced by having a good initial estimate.Existing tools to the data sampling problem require input data in formats that themselves can be computationally costly to acquire. For instance, tree-based sequence selection procedures can be efficient, but they rely on a phylogenetic tree or a pairwise genetic distance matrix, both of which require alignment of homologous sequences (Balaban et al., 2019; e.g. Widmann et al., 2006). Adding both the time for sequence alignment and tree estimation presents a barrier to their use.The diverse-seq sequence selection algorithms are linear in time for the number of sequences and more flexible than published approaches. While the algorithms do not require sequences to be homologous, when applied to homologous sequences, the set selected is comparable to what would be expected based on genetic distance. The diverse-seq clustering algorithm is linear in time for the combined sequence length. For homologous sequences, the estimated trees are approximations to that estimated from an alignment by IQ-TREE2 (Minh et al., 2020).Competing Interest StatementThe authors have declared no competing interest.
- Under ReviewGraphBin-Tk: assembly graph-based metagenomic binning toolkitVijini Mallawaarachchi, Anuradha Wickramarachchi, Robert McArthur, and 3 more authorsThe Journal of Open Source Software, 2024
The study of genetic material directly obtained from natural environments, termed metage- nomics, offers valuable insights into microbial communities and their impact on human health and environmental dynamics (Edwards et al., 2013; Pargin et al., 2023). Once the genetic ma- terial is extracted, sequenced to obtain reads and assembled to obtain contigs, a process known as metagenomic binning is used to cluster contigs into bins that represent different taxonomic groups which results in draft microbial genomes or metagenome-assembled genomes (MAGs) (V. Mallawaarachchi et al., 2024). Several automated metagenomic binning tools incorporating novel computational methods have been introduced (Alneberg et al., 2014; Chandrasiri et al., 2022; D. D. Kang et al., 2019; Pan et al., 2023; Wu et al., 2015; Xue et al., 2022, 2024) which have led to the discovery and characterisation of many novel micro-organisms (Brooks et al., 2017; L. Kang et al., 2024). Conventional metagenomic binning tools make use of features such as nucleotide composition and abundance information of contigs, yet find it challenging to bin sequences of closely related species and sequences that have noisy features. Binning tools, such as MetaCoAG (V. Mallawaarachchi & Lin, 2022a, 2022b) that use metagenome assembly graphs (a structure containing the connectivity information of contigs) are gaining popularity due to their improved binning results over conventional binning methods. Moreover, assembly graph-based bin refinement tools such as GraphBin (V. Mallawaarachchi et al., 2020) and GraphBin2 (V. G. Mallawaarachchi et al., 2020, 2021) have been introduced to refine binning results from existing binning tools. Yet, these tools exist as individual software and running them individually can be complex, time-consuming and less accessible. Here we present GraphBin-Tk, an assembly graph-based metagenomic binning tool that combines the capabilities of MetaCoAG, GraphBin and GraphBin2, along with additional pre-processing and post-processing functionality into one comprehensive toolkit. GraphBin-Tk is hosted at https://github.com/metagentools/gbintk.
2021
- thesisQuantifying the disequilibrium of DNA sequence evolutionKatherine Caley2021
Most models of sequence divergence assume that nucleotide composition does not change through time. This assumption requires a state of mutation equilibrium which is almost impossible if the processes affecting mutagenesis change through time. Considerable empirical evidence strongly suggests that this may be incorrect. This honours thesis addresses this possibility through developing the following statistical measures: a test for the existence of mutation disequilibrium, a test of its equivalence between processes and a measurement of the magnitude of mutation disequilibrium. I used careful construction of edge cases with simulated data to establish the consistency of the statistics with theoretical expectations. I applied the statistics to empirical data from cases with striking prior evidence for recent perturbations a↵ecting: an entire genome (loss of DNA methylation in Drosophila melanogaster); or, a small genomic segment (Fxy in Mus musculus). Using paired experimental designs, I find strong systematic evidence for departure from mutation equilibrium. I further show the statistical measure of magnitude is also elevated in these cases. Applying the methods to Human evolution, I conservatively estimate more than 50% of our genome is in mutation disequilibrium. I then discuss the implication of these results for research domains that use models of sequence divergence. The development of my statistical tools represents a significant contribution to our e↵orts to understand mutation disequilibrium and how it impacts on the evolution of DNA sequence.