Research Ideas and Outcomes : Research Article
|
Corresponding author: Alexis Criscuolo (alexis.criscuolo@pasteur.fr)
Received: 14 May 2019 | Published: 10 Jun 2019
© 2019 Alexis Criscuolo
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Criscuolo A (2019) A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. Research Ideas and Outcomes 5: e36178. https://doi.org/10.3897/rio.5.e36178
|
This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.
phylogenetics, alignment-free, genome, evolutionary distance, branch support, MinHash, Balanced Minimum Evolution, elementary quartets
Evolutionary relationships between species are commonly represented by a phylogenetic tree inferred from multiple sequence alignments of orthologous genes. Tree reconstructions are generally performed from multiple gene datasets (e.g. core-gene set) because they enhance the overall phylogenetic signal by reducing the random error caused by a small number of characters (e.g.
To infer a phylogenetic tree that represents the evolutionary relationships of a set of genomes, an alternative approach is to estimate a pairwise distance between each pair of unaligned genomes, and to next build a phylogenetic tree with a fast distance-based reconstruction method. Such bioinformatics procedures are becoming popular because they allow dealing with thousands of assembled genomes, depend on few assumptions regarding their evolutionary process, and quickly lead to a phylogenetic tree with minimal manual intervention (
The first step is the estimation of a dissimilarity value between each pair of unaligned genome nucleotide sequences. Many approaches were proposed to compute such values for a phylogenetic purpose, based on k-mer comparisons (e.g.
The second step is the correction of each computed pairwise dissimilarity value into a numerical quantity that is proportional to the evolutionary distance between the corresponding genomes. An evolutionary distance is the number of substitution events per character that have occurred along the path separating two leaves within the 'true' phylogenetic tree representing the evolutionary relationships among genomes. This step is important because using pairwise dissimilarities that are non-linear with respect to evolutionary distances is expected to lead to incorrect phylogenetic tree topologies (
The third step is the reconstruction of the phylogenetic tree from the estimated evolutionary distances. Many algorithms exist for this purpose (see e.g.
The final step is the estimation of a confidence value at each branch of the inferred tree. Several strategies were proposed to this aim (
This paper reports a new bioinformatics procedure that is based on well-argued choices for each of the four previously described steps. By analyzing simulated genome sequences, this procedure is shown to efficiently estimate the pairwise evolutionary distances between each pair of genomes, therefore allowing the reconstruction of accurate phylogenetic trees. This expected accuracy is illustrated by the analysis of 187 real genome datasets, representative of different genera within the bacterial, archaeal and eukaryotic phyla. All these analyses show that this novel bioinformatics procedure, implemented in the script JolyTree (gitlab.pasteur.fr/GIPhy/JolyTree), is an efficient approach to infer a phylogenetic tree from hundreds of genome assemblies in a few minutes.
To estimate a pairwise dissimilarity between each pair of genomes, the Mash method (
\( (1) ~~~~~~~~~~~~~~~~~~~~ m_{ij} = k^{-1} \Big( \log_e (1+J_{ksij}) - \log_e (2 \, J_{ksij}) \Big) \)
where \(J_{ksij}\) is an estimate of the Jaccard index between the two k-mer sets induced by \(i\) and \(j\) based on hashed k-mer subsets of size \(s\), called MinHash sketches (for more details, see
\((2) ~~~~~~~~~~~~~~~~~~~~ k = \lceil \log_4 \big( g \, (1-q) \big) - \log_4 \; q \rceil\).
Of note, when considering two genomes \(i\) and \(j\) of respective sizes \(g_i\) and \(g_j\), one can select \(g = \max ( g_i , g_j )\).
As the Mash dissimilarity \(m_{ij}\) approximates the expected p-distance \(p_{ij}\) that can be observed between the genome sequences \(i\) and \(j\), there exists several ways to correct the expected proportion of nucleotide differences into an estimated number of substitution events per character. Such corrections could be formalized by the following formula:
\((3) ~~~~~~~~~~~~~~~~~~~~ d_{ij} = -b_1 \log_e (1 - p_{ij} / b_2 ) \)
where \(b_1\) and \(b_2\) are defined according to explicit models of nucleotide sequence evolution. If equal nucleotide frequencies are observed and the rate of substitution is considered identical for all pairs of nucleotides, the correction formula \((3)\) is determined by \(b_1 = b_2 = 0.75\) (
The program FastME was chosen to perform the phylogenetic tree inference because it allows inferring accurate phylogenetic trees with very fast running times (
Finally, the program REQ (gitlab.pasteur.fr/GIPhy/REQ) was chosen for assessing the branch confidence values of the inferred tree. This tool estimates the rate of elementary quartets (REQ) for each branch of a given phylogenetic tree from the associated distances \(d_{ij}\), as described by
The procedure described above was implemented in Bash (www.gnu.org/software/bash), therefore running on UNIX, Linux and most OS X operating systems. This implementation, named JolyTree, is freely available at gitlab.pasteur.fr/GIPhy/JolyTree. It directly reads a set of \(n\) assembled genomes in FASTA format from a specified directory. JolyTree allows setting the value of \(q\) (= 0.00001 by default) to estimate the k-mer size with formula \((2)\) from the size \(g\) of the largest genome. A unique sketch size \(s\) is defined as 25% of the average of the \(n\) genome lengths. As the computation of each pairwise Mash dissimilarity could be performed independently of the other ones, JolyTree allows this costly \(O(n^2)\) step to be executed on multiple threads. By default, all Mash dissimilarity values are automatically corrected by formula \((3)\) when at least one of them is larger than 0.1 (see below); however, this cutoff can be modified with dedicated option. The distance noising procedure is repeated 100 times by default with \(\epsilon\) varying from 0.1 (moderate noising) to 0.7 (important noising). All arithmetic operations are performed by gawk v. 4.1.4 (ftp.gnu.org/gnu/gawk). The results presented below were obtained using Mash v. 2.1 (github.com/marbl/Mash), FastME v. 2.1.5.1 (gite.lirmm.fr/atgc/FastME), and REQ v. 1.2 (gitlab.pasteur.fr/GIPhy/REQ).
Several analyses from simulated and real datasets were performed to show that JolyTree allows accurate phylogenetic trees to be quickly inferred from genome sequences. The following results illustrate the accuracy and treelikeness of the F81-corrected distance estimates, the usefulness of the data noising strategy for inferring trees, and the fast running times observed when analyzing large genome datasets. Some phylogenetic analyses of real-case genome datasets are also presented and discussed.
In order to observe the ability of JolyTree to estimate the evolutionary distance between a pair of genomes, a large number of sequence pairs was simulated. Given an evolutionary distance \(d\) varying from 0.05 to 0.60 (step = 0.05), the program SeqGen v. 1.3.4 (
Box plots representing the mean, lower and upper quartiles, and 25th and 975th permilles of \(N\) = 500 estimates \(\hat{d}\) of the evolutionary distances \(d\) = 0.05, 0.10, ..., 0.60. Each linear regression through the origin with slope value b is represented with dashed lines.
As expected (e.g.
Plots associated to the data represented in Fig.
As JolyTree is expected to accurately estimate the evolutionary distance \(d\) between every pair of genomes that are not too distant (e.g. \(d\) < 0.5), this bioinformatics procedure is recommended for quickly inferring phylogenetic trees from genomes belonging to the same genus. It was used to reconstruct a phylogenetic tree from the \(n\) = 96 Listeria genome contig sets generated by
Model tree used for genome data simulation by
To measure the overall phylogenetic signal induced by the estimated distances, the Arboricity coefficient (arb;
To observe the ability of JolyTree to quickly infer accurate species trees from real data, genome assemblies from the RefSeq collection (
This representative collection of genome sequence sets shows that the GC content is very heterogeneous across genera (Suppl. material
It should be stressed that the data noising strategy had a moderate usefulness on these datasets, as no better tree (according to the BME criterion) than the one inferred by FastME alone was obtained for 125 datasets. This could be explained by the overall good treelikeness of the estimated evolutionary distances (Suppl. material
Distribution of the normalized Balanced Minimum Evolution (BME) tree length differences observed between phylogenetic trees inferred by FastME with and without the data noising strategy implemented by JolyTree. This distribution represents 187 observed values \(\Delta(T_0, T) = (\ell(T_0)-\ell(T))/\ell(T_0)\), where \(T_0\) is the tree inferred by FastME alone, \(T\) the one inferred by JolyTree via the data noising strategy, and \(\ell(T)\) the BME tree length estimate of \(T\). A tree \(T\) is considered as more accurate that \(T_0\) when \(\Delta(T_0, T)\) > 0.
Thanks to its ability to run on multiple threads, JolyTree is quite fast. On an Intel Xeon E5-1660 v4 (16Gb RAM) running under Linux Debian 4.9.110-3+deb9u6, 90% of the 187 genome datasets were analyzed in less than 15 and 5 minutes each, on 6 and 12 threads, respectively. The main variable having a negative impact on the overall running times is the number \( n\) of genomes, e.g. with quite comparable genome sizes (e.g. \(\overline{g}\) = 6.5 Mb), observed running times varied from 69 and 46 seconds with \(n\) = 30 (Duganella) to 30 and 19 minutes with \(n\) = 291 (Rhizobium) on 6 and 12 threads, respectively. Average genome size \(\overline{g}\) has less impact on the overall running times, as it only slows the Mash sketching step, e.g. with \(n\) = 34, observed running times varied from 62 and 16 seconds with \(\overline{g}\) = 2.8 Mb (Caldicellulosiruptor) to 4 and 3 minutes with \(\overline{g}\) = 34.1 Mb (Aspergillus) on 6 and 12 threads, respectively. JolyTree therefore represents a useful tool to quickly infer phylogenetic trees from large sets of genome sequences on standard computers.
Finally, to illustrate the accuracy of the phylogenetic trees inferred by JolyTree, a bibliographical survey was performed for each of the 187 genera to find recently published phylogenetic trees for comparison. Considering only genome datasets made up by at least four species, as well as only robust phylogenomics analyses based on core genome data for comparison, this survey led to six genera: Aggregatibacter (
The Aggregatibacter tree inferred by JolyTree (Fig.
Phylogenetic tree inferred by JolyTree from the RefSeq genomes belonging to the genus Aggregatibacter. For each type strain (in bold), the clade determined by the isolates expected to belong to the same species (e.g. estimated pairwise distances < 0.05) is labeled by the species name and colored accordingly. Leaf names were automatically generated. Scale bar corresponds to an estimated evolutionary distance of 0.025. The inset summarizes a model tree of the Aggregatibacter species based on the phylogenetic analysis of
The Borreliella tree inferred by JolyTree is represented in Fig.
Phylogenetic tree inferred by JolyTree from the RefSeq genomes belonging to the genus Borreliella. For each type strain (in bold), the clade determined by the isolates expected to belong to the same species (e.g. estimated pairwise distances < 0.05) is labeled by the species name and colored accordingly. Leaf names were automatically generated. Scale bar corresponds to an estimated evolutionary distance of 0.025. The inset summarizes a model tree of the Borreliella species based on the phylogenetic analysis of
Concerning the genus Elizabethkingia, two recently published conflicting phylogenetic trees exist (inset in Fig.
Phylogenetic tree inferred by JolyTree from the RefSeq genomes belonging to the genus Elizabethkingia. For each type strain (in bold), the clade determined by the isolates expected to belong to the same species (e.g. estimated pairwise distances < 0.05) is labeled by the species name and colored accordingly. Leaf names were automatically generated. Scale bar corresponds to an estimated evolutionary distance of 0.025. The inset summarizes two model trees of the Elizabethkingia species based on the phylogenetic analyses of
The Lactococcus tree inferred by JolyTree (Fig.
Phylogenetic tree inferred by JolyTree from the RefSeq genomes belonging to the genus Lactococcus. For each type strain (in bold), the clade determined by the isolates expected to belong to the same species (e.g. estimated pairwise distances < 0.05) is labeled by the species name and colored accordingly. Leaf names were automatically generated. Scale bar corresponds to an estimated evolutionary distance of 0.025. The inset summarizes a model tree of the Lactococcus species based on the phylogenetic analysis of
The Providencia tree inferred by JolyTree is represented in Fig.
Phylogenetic tree inferred by JolyTree from the RefSeq genomes belonging to the genus Providencia. For each type strain (in bold), the clade determined by the isolates expected to belong to the same species (e.g. estimated pairwise distances < 0.05) is labeled by the species name and colored accordingly. Leaf names were automatically generated. Scale bar corresponds to an estimated evolutionary distance of 0.025. The inset summarizes a model tree of the Providencia species based on the phylogenetic analysis of
The phylogenetic relationships among Ralstonia species was deeply studied by
Phylogenetic tree inferred by JolyTree from the RefSeq genomes belonging to the genus Ralstonia. For each type strain (in bold), the clade determined by the isolates expected to belong to the same species (e.g. estimated pairwise distances < 0.05) is labeled by the species name and colored accordingly. Leaf names were automatically generated. Scale bar corresponds to an estimated evolutionary distance of 0.025. The inset summarizes a model tree of the Ralstonia species based on the phylogenetic analysis of
These detailed phylogenetic analyses of the six genera Aggregatibacter, Borreliella, Elizabethkingia, Lactococcus, Providencia and Ralstonia show that JolyTree is a useful tool to quickly infer species trees from whole genome assemblies that are comparable with trees reconstructed from the concatenation of large sets of multiple homologous sequence alignments. They also show that these phylogenetic trees are practical representations to detect new species within a bacterial genus.
This paper describes a novel bioinformatics procedure, implemented by the Bash script JolyTree (gitlab.pasteur.fr/GIPhy/JolyTree), to perform a complete phylogenetic analysis from unaligned genome sequence sets. Such procedure is quite fast because it takes advantage of the ability of the pairwise distance estimate step to be run on multiple threads. Simulation and real case analyses have shown that JolyTree leads to accurate trees. Some incorrect branching can be observed within the inferred phylogenetic trees, but they are often assessed by weak branch supports. Therefore, JolyTree represents a useful approach for performing phylogenetic analyses with fast running times from hundreds of genome assemblies. Of note, this bioinformatics procedure was used previously for inferring Corynebacterium (
More generally, this paper confirms the usefulness of the MinHash dissimilarity and its ability to efficiently approximate the pairwise p-distance (and the related ANI similarity) between two genome contig sets. However, this paper highlights the importance of transforming a p-distance value into an evolutionary distance estimate in the context of phylogenetic inference, especially when p-distance values are large (e.g. > 0.1). The use of the F81-transformation represents a very simple but efficient way to quickly obtain efficient pairwise evolutionary distance estimates. Of note, such transformation can be easily adaptated with future implementations of the pairwise Mash dissimilarities (e.g.
This manuscript is dedicated to the memory of Nicolas Joly, who helped to improve the implementation of the script JolyTree for running on multiple threads. The author thanks Laetitia Fabre, Julien Guglielmini and Rachel Legendre for useful comments on the manuscript. The author is also grateful to Brian D. Ondov for reviewing this manuscript. The author is obliged to Sylvain Brisse and to the Hub de Bioinformatique et Biostatistique, Institut Pasteur, Paris (France), for support.
AC conceived the presented idea, implemented the script, conceived the simulations, performed the computations, and wrote the manuscript.
No conflict of interest to declare.
For each genus, the table reports the number of genomes (no. taxa), the mean genome sequence size (mean genome size) and the associated standard deviation (SD genome size), the mean percentage of observed GC content (mean %GC) and the associated standard deviation (SD %GC), the largest estimated Mash dissimilarity (max Mash dissimilarity) and F81-corrected Mash distance (max F81 distance), the treelikeness coefficients (arboricity and mean delta), the proportion of external negative branches (% negative branch), the mean rate of elementary quartets (mean branch support) and the associated standard deviation (SD branch support), and the inferred phylogenetic tree in NEWICK format.
Each raw contains the taxon name used during the JolyTree analysis (column 1), followed by the corresponding entry from the RefSeq assembly report (ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt).