Phylotrack: C++ and Python libraries for in silico phylogenetic tracking

Read original: arXiv:2405.09389 - Published 7/18/2024 by Emily Dolson, Santiago Rodriguez-Papa, Matthew Andres Moreno

🐍

Overview

This paper describes a project called Phylotrack that provides tools for tracking and analyzing the evolution of digital populations in computer simulations.
The Phylotrack project includes two main components: a C++ library called Phylotracklib and a Python wrapper called Phylotrackpy.
These tools enable researchers to perfectly record the ancestry and evolutionary history of simulated populations, which can provide insights into the dynamics of evolution.

Plain English Explanation

The paper discusses in silico evolution, which is the process of simulating evolution within digital populations of computational agents. These digital populations undergo evolution just like real-world biological populations, with

heredity

variation

, and

differential reproductive success

driving the evolutionary process.

By running these computer simulations, researchers can study evolutionary dynamics in ways that would be impossible to do in a real-world laboratory or field setting. One key advantage is complete observability - the ability to perfectly record all parent-child relationships across the simulation history, yielding complete phylogenies (ancestry trees). This information reveals when traits were gained or lost and can help researchers understand the underlying evolutionary dynamics at play.

The Phylotrack project provides software libraries to help researchers track and analyze these phylogenies in their digital evolution systems. The C++ library, Phylotracklib, and the Python wrapper, Phylotrackpy, give researchers tools to attach phylogenetic tracking to their simulations and measure various metrics related to the evolutionary history. The design of these libraries prioritizes efficiency, allowing for fast generational turnover even in large simulated populations.

Technical Explanation

The Phylotrack project consists of two main components: Phylotracklib and Phylotrackpy.

Phylotracklib is a C++ library developed under the Empirical project. It provides a public-facing API that allows researchers to integrate phylogenetic tracking into their digital evolution systems. The C++ implementation prioritizes efficiency, enabling fast generational turnover for agent populations numbering in the tens of thousands.

Phylotrackpy is a Python wrapper around Phylotracklib, created using the Pybind11 library. This Python interface provides researchers with a stand-alone tool for measuring a variety of popular phylogenetic topology metrics, in addition to the ability to integrate the phylogenetic tracking functionality into their own Python-based digital evolution systems.

Both components of the Phylotrack project offer features for reducing the memory footprint of phylogenetic information, such as phylogeny pruning and abstraction. This allows researchers to efficiently store and analyze the complete evolutionary history of their simulated populations, which can yield insights into the underlying dynamics and potentially inform the study of real-world evolutionary processes.

Critical Analysis

The paper does not mention any significant caveats or limitations of the Phylotrack project. However, it is worth considering the extent to which the insights gained from these digital evolution simulations can be generalized to real-world biological systems. While the ability to perfectly observe the complete evolutionary history is a unique advantage of in silico evolution, the fidelity of the simulations to actual evolutionary processes may be a potential area of concern.

Additionally, the paper does not address the potential challenges of scaling the Phylotrack tools to handle very large-scale simulations or complex digital ecosystems. As the size and complexity of the simulated populations increase, the computational and storage requirements for tracking the phylogenies may become a limiting factor.

Finally, the paper does not discuss the potential applications of the Phylotrack tools beyond the field of digital evolution research. [Exploring the use of these tools in domains like phylogeny-informed interaction estimation or inferring the phylogeny of large language models could broaden the impact and usefulness of the Phylotrack project.

Conclusion

The Phylotrack project provides researchers with powerful tools for tracking and analyzing the evolutionary history of digital populations in computer simulations. By enabling the perfect observation of phylogenies, the project facilitates a deeper understanding of the underlying dynamics of evolution. This complements traditional in vitro and in vivo research, allowing for experiments that would be impossible to conduct in the real world.

While the current focus of the Phylotrack project is on digital evolution research, the tools developed could potentially find applications in other domains, such as phylogeny-informed interaction estimation or inferring the phylogeny of large language models. As the field of in silico evolution continues to advance, the Phylotrack project stands as an important contribution to the tools and methodologies available to researchers studying the dynamics of evolutionary processes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🐍

Phylotrack: C++ and Python libraries for in silico phylogenetic tracking

Emily Dolson, Santiago Rodriguez-Papa, Matthew Andres Moreno

In silico evolution instantiates the processes of heredity, variation, and differential reproductive success (the three ingredients for evolution by natural selection) within digital populations of computational agents. Consequently, these populations undergo evolution, and can be used as virtual model systems for studying evolutionary dynamics. This experimental paradigm -- used across biological modeling, artificial life, and evolutionary computation -- complements research done using in vitro and in vivo systems by enabling experiments that would be impossible in the lab or field. One key benefit is complete, exact observability. For example, it is possible to perfectly record all parent-child relationships across simulation history, yielding complete phylogenies (ancestry trees). This information reveals when traits were gained or lost, and also facilitates inference of underlying evolutionary dynamics. The Phylotrack project provides libraries for tracking and analyzing phylogenies in in silico evolution. The project is composed of 1) Phylotracklib: a header-only C++ library, developed under the umbrella of the Empirical project, and 2) Phylotrackpy: a Python wrapper around Phylotracklib, created with Pybind11. Both components supply a public-facing API to attach phylogenetic tracking to digital evolution systems, as well as a stand-alone interface for measuring a variety of popular phylogenetic topology metrics. Underlying design and C++ implementation prioritizes efficiency, allowing for fast generational turnover for agent populations numbering in the tens of thousands. Several explicit features (e.g., phylogeny pruning and abstraction, etc.) are provided for reducing the memory footprint of phylogenetic information.

7/18/2024

A Guide to Tracking Phylogenies in Parallel and Distributed Agent-based Evolution Models

Matthew Andres Moreno, Anika Ranjan, Emily Dolson, Luis Zaman

Computer simulations are an important tool for studying the mechanics of biological evolution. In particular, in silico work with agent-based models provides an opportunity to collect high-quality records of ancestry relationships among simulated agents. Such phylogenies can provide insight into evolutionary dynamics within these simulations. Existing work generally tracks lineages directly, yielding an exact phylogenetic record of evolutionary history. However, direct tracking can be inefficient for large-scale, many-processor evolutionary simulations. An alternate approach to extracting phylogenetic information from simulation that scales more favorably is post hoc estimation, akin to how bioinformaticians build phylogenies by assessing genetic similarities between organisms. Recently introduced ``hereditary stratigraphy'' algorithms provide means for efficient inference of phylogenetic history from non-coding annotations on simulated organisms' genomes. A number of options exist in configuring hereditary stratigraphy methodology, but no work has yet tested how they impact reconstruction quality. To address this question, we surveyed reconstruction accuracy under alternate configurations across a matrix of evolutionary conditions varying in selection pressure, spatial structure, and ecological dynamics. We synthesize results from these experiments to suggest a prescriptive system of best practices for work with hereditary stratigraphy, ultimately guiding researchers in choosing appropriate instrumentation for large-scale simulation studies.

5/17/2024

Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution

Mridul Khurana, Arka Daw, M. Maruf, Josef C. Uyeda, Wasila Dahdul, Caleb Charpentier, Yasin Bak{i}c{s}, Henry L. Bart Jr., Paula M. Mabee, Hilmar Lapp, James P. Balhoff, Wei-Lun Chao, Charles Stewart, Tanya Berger-Wolf, Anuj Karpatne

A central problem in biology is to understand how organisms evolve and adapt to their environment by acquiring variations in the observable characteristics or traits of species across the tree of life. With the growing availability of large-scale image repositories in biology and recent advances in generative modeling, there is an opportunity to accelerate the discovery of evolutionary traits automatically from images. Toward this goal, we introduce Phylo-Diffusion, a novel framework for conditioning diffusion models with phylogenetic knowledge represented in the form of HIERarchical Embeddings (HIER-Embeds). We also propose two new experiments for perturbing the embedding space of Phylo-Diffusion: trait masking and trait swapping, inspired by counterpart experiments of gene knockout and gene editing/swapping. Our work represents a novel methodological advance in generative modeling to structure the embedding space of diffusion models using tree-based knowledge. Our work also opens a new chapter of research in evolutionary biology by using generative models to visualize evolutionary changes directly from images. We empirically demonstrate the usefulness of Phylo-Diffusion in capturing meaningful trait variations for fishes and birds, revealing novel insights about the biological mechanisms of their evolution.

8/2/2024

🔎

Phylo2Vec: a vector representation for binary trees

Matthew J Penn, Neil Scheidwasser, Mark P Khurana, David A Duch^ene, Christl A Donnelly, Samir Bhatt

Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is NP-hard and thus computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search. These methods use different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with $n$ leaves to a unique integer vector of length $n-1$. The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.

5/13/2024