Katherine Caley

Email me at katherine [dot] caley [at] anu [dot] edu [dot] au

Hi I’m Katherine.

I use statistical and mathematical techniques to extract meaning from biological data, with an interest in microbial communities and mutagenesis – asking how life evolves and adapts at its most fundamental levels.

I’m passionate about making science better for everyone: advocating for open-source, reproducible research, and designing efficient algorithms to reduce the environmental impact of computational science.

I am a core developer and maintainer of cogent3, a Python library with a wide range of tools for the analysis of biological sequence data.

I am actively seeking PhD positions for fall 2025 If my interests align with yours, please feel free to reach out to me!

news

Nov 27, 2024	checkout the blog post on using cogent3 and Plotly to integrate genomic data analyses and visualisations!
Nov 11, 2024	Diverse-Seqs has been submitted to BioRxiv!

selected publications

A deep-learning framework to predict cancer treatment response from histopathology images through imputed transcriptomics

Danh-Tai Hoang, Gal Dinstag, Eldad D Shulman, and 8 more authors

Nature Cancer, 2024

DOI PDF
Under Review

diverse-seq: an application for alignment-free selecting and clustering biological sequences

Gavin Huttley, Katherine Caley, and Robert McArthur

bioRxiv, 2024

Abs DOI PDF

The algorithms required for phylogenetics — multiple sequence alignment and phylogeny estimation — are both compute intensive. As the size of DNA sequence datasets continues to increase, there is a need for a tool that can effectively lessen the computational burden associated with this widely used analysis.diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.The diverse-seq algorithms are not limited to homologous sequences. As such, they can improve the performance of other workflows. For instance, machine learning projects that involve non-homologous sequences can benefit as representative sampling can mitigate biases from imbalanced groups.diverse-seq is a BSD-3 licensed Python package that provides both a command-line interface and cogent3 plugins. The latter simplifies integration by users into their own analyses. It is available via the Python Package Index and GitHub.Statement of need Accurately selecting a representative subset of biological sequences can improve the statistical accuracy and computational performance of data sampling workflows. In many cases, the reliability of such analyses is contingent on the sample capturing the full diversity of the original collection (e.g. estimating large phylogenies Parks et al., 2018; Zhu et al., 2019). Additionally, the computation time of algorithms reliant on numerical optimisation, such as phylogenetic estimation, can be markedly reduced by having a good initial estimate.Existing tools to the data sampling problem require input data in formats that themselves can be computationally costly to acquire. For instance, tree-based sequence selection procedures can be efficient, but they rely on a phylogenetic tree or a pairwise genetic distance matrix, both of which require alignment of homologous sequences (Balaban et al., 2019; e.g. Widmann et al., 2006). Adding both the time for sequence alignment and tree estimation presents a barrier to their use.The diverse-seq sequence selection algorithms are linear in time for the number of sequences and more flexible than published approaches. While the algorithms do not require sequences to be homologous, when applied to homologous sequences, the set selected is comparable to what would be expected based on genetic distance. The diverse-seq clustering algorithm is linear in time for the combined sequence length. For homologous sequences, the estimated trees are approximations to that estimated from an alignment by IQ-TREE2 (Minh et al., 2020).Competing Interest StatementThe authors have declared no competing interest.