Pangenome Compare: Evaluating Tools for Pangenome Graph Construction

I. Scalability and Performance Benchmarking of Pangenome Graph Tools

Pangenome graphs are increasingly recognized as a superior representation of genomic diversity compared to linear reference genomes. As the volume of sequenced genomes grows, the efficiency and scalability of pangenome graph construction tools become critical. This article provides a Pangenome Compare analysis of five prominent methods – Bifrost, mdbg, Minigraph, Minigraph-Cactus, and pggb – each employing distinct graph data structures. We assess their performance across various datasets to understand their scalability and capabilities in representing genetic diversity.

Our investigation involved constructing pangenomes using these five tools on datasets of varying sizes: 2, 10, and 104 human haplotypes (refer to Table 3 in the original article). The primary objective of this pangenome compare study is to evaluate the computational efficiency of each algorithm and analyze the characteristics of the resulting pangenome graphs. This assessment is crucial for determining the suitability of each method for handling the vast genomic datasets anticipated in the near future, potentially encompassing thousands or even millions of human genomes.

We benchmarked each tool based on several key performance indicators: running time, peak memory usage, and disk space required for storing the output graph and associated annotations. Additionally, we compared the structural complexity of the generated graphs by examining the number of nodes, edges, and connected components. The results of this comprehensive pangenome compare are summarized in Table 1 (from the original article).

In terms of computational speed, mdbg demonstrated remarkable efficiency, being two orders of magnitude faster than the other tools across all datasets. It completed pangenome construction in approximately two minutes for the H2 dataset and half an hour for the H104 dataset. Bifrost emerged as the second fastest tool for the H104 dataset, completing in 18 hours, while Minigraph was the second quickest for the H2 dataset, taking about 8 minutes. Minigraph-Cactus required significantly more time than Minigraph, roughly an order of magnitude greater. Notably, pggb and Minigraph-Cactus encountered limitations with the largest dataset (H104); pggb did not finish execution within two weeks, and Minigraph-Cactus returned an error, indicating scalability challenges for these tools with very large datasets.

Regarding memory consumption, mdbg consistently exhibited the lowest memory footprint, utilizing less than half the memory of other tools (31 GB for H104). Minigraph followed with 61 GB of memory usage on the H104 dataset. For the smaller H2 dataset, all tools utilized memory ranging from 8 to 66 GB.

All tools demonstrated reasonable disk space usage for storing the resulting pangenome graphs, staying below 12 GB for H10 and 38 GB for H104 datasets. Interestingly, Minigraph-Cactus and pggb, which retain all input variations and allow for direct haplotype reconstruction from the graph, were among the most disk-space efficient. For instance, Minigraph-Cactus used only 3.6 GB for H2 and 7 GB for H10. While Bifrost and Minigraph operate primarily in memory, pggb, Minigraph-Cactus, and mdbg utilize disk space for intermediate files, with space requirements comparable to, or up to three times the size of, the input data for Minigraph-Cactus.

II. Topological Differences in Pangenome Graphs: A Comparative View

Graph topology metrics, such as the number of nodes, edges, and connected components, are crucial for understanding the granularity of variation representation and the overall complexity and navigability of pangenome graphs. This section of our pangenome compare delves into these topological aspects.

Across the H2 dataset, the number of nodes in the generated graphs varied significantly, ranging from 17,000 to 11 million depending on the tool. In all cases, the node count was considerably smaller (at least three orders of magnitude) than the total number of bases in the input haplotypes. This observation underscores the effectiveness of pangenome graphs in compressing the linear segments of haplotypes. Tools like Minigraph and mdbg, which discard certain types of variations, resulted in graphs with approximately 104 to 105 nodes across all datasets. Conversely, tools that preserve comprehensive variation, such as Bifrost, Minigraph-Cactus, and pggb, yielded graphs with node counts in the range of 106 to 107. Notably, for all tools, the transition from the H10 to the H104 dataset led to a roughly 5x increase in node count, indicating sublinear growth in graph complexity with increasing haplotype numbers.

The number of connected components also exhibited substantial variation across methods and datasets, ranging from 2 to 1402. The number of large components (containing more than 1% of total base pairs) ranged from 1 to 30. Ideally, if chromosomes were perfectly segregated in the pangenome graph, we would expect to see 24 connected components (one for each nuclear chromosome, excluding mitochondria). Minigraph closely approached this ideal, producing 24 large connected components, mirroring the chromosome count in the CHM13 v2.0 reference genome (25 including mitochondria). Bifrost and Minigraph-Cactus yielded graphs with fewer than 25 connected components, while mdbg and pggb produced more than 25.

In Bifrost de Bruijn graphs (dBGs), the majority of sequences (>99.99%) are consolidated into a single giant component. This is due to the inherent nature of dBGs where chromosomes are linked through shared k-mers. In contrast, mdbg graphs constructed from the H2 dataset exhibited 24 large components, potentially because of the absence of extended, highly similar inter-chromosomal regions in this smaller dataset. Minigraph did not incorporate any mitochondrial sequences from the input haplotypes into its graph, while Minigraph-Cactus graphs did include them.

While chromosome-by-chromosome analysis is a common practice in pangenomics, our pangenome compare intentionally used entire genomes as input. This approach served two purposes: (i) to stress-test the scalability of the tools and (ii) to enable the detection of inter-chromosomal rearrangements like inversions, translocations, and transposable elements, even though some inter-chromosomal events might be alignment artifacts. The effects of this whole-genome approach are evident in the pggb and Minigraph-Cactus H10 variation graphs shown in Figure 1 (from the original article). The pggb graph showed 19 chromosomes linked into a single large component, comprising 25 million nodes and 83% of total base pairs. The remaining components, numbering 859, accounted for only 4.7% of the total bases, representing smaller sequences from the input haplotypes. In the Minigraph-Cactus graph, all chromosomes except chromosome 18 were linked into a single giant component. Chromosome 18 formed a separate component, while sexual chromosomes (X and Y) were connected together in another distinct component.

Fig. 1

Pangenome construction and visualization workflow. A: Overall process using 5 tools on 3 datasets. B: Minigraph’s 104-haplotype variation graph. C: HLA (MHC) region focus from B. D: DRB1-5 locus of HLA from C. E: pggb’s 10-haplotype variation graph. F: Minigraph-Cactus’s 10-haplotype graph. G: mdbg’s 104-haplotype pangenome. H: Bifrost dBG for 10 haplotypes. Minigraph graphs are directly rendered; others simplified with gfatools and rendered with Bandage. VG: Variation Graph.

III. Interpretation of Variation in Pangenome Graphs: Focus on HLA Loci

A critical aspect of pangenome compare is evaluating how each tool facilitates the detection and interpretation of genetic variations within the constructed graphs. Prior research suggests that for tasks like identifying genomic diversity and mapping reads to complex regions, building graphs on specific loci rather than entire genomes can be advantageous. In this section of our pangenome compare, we assess how pangenomes built from whole haplotypes represent biologically significant loci, specifically focusing on two regions within the Human Leukocyte Antigen (HLA) complex.

III.A. Extraction of HLA-E and Complex HLA Regions

We extracted regions corresponding to two HLA loci from the complete pangenome graphs generated by each tool. The HLA complex is of significant medical interest due to its association with numerous disease-related variants. The first locus we examined is the HLA-E gene, a part of the nonclassical class I region genes, spanning 4.8 kbp and known for its relative conservation across populations. It has been linked to COVID-19 severity, showing association with hospitalization and ICU admission. The second locus is a more complex HLA region, approximately 58 kbp long, encompassing the HLA-A gene (part of the highly polymorphic classical class I region) and genes like HLA-U, HLA-K, HLA-H, and HCG4B.

To extract these regions, we employed a custom script tailored to each pangenome graph type. The script aimed to generate subgraphs representing these loci and their variations. Where possible, we used precise genomic coordinates for extraction; otherwise, we resorted to sequence-to-graph alignment. For variation graphs and mDBGs, nodes near an aligned region generally correspond to variations of the locus. However, this is not always the case for standard dBGs. Accurately and completely extracting loci representations remains a challenge for dBGs.

III.B. HLA-E: A Low Complexity Region Analysis

Figure 2 (from the original article) illustrates the representation of the HLA-E locus by each tool across the H2, H10, and H104 datasets. As anticipated, Minigraph did not detect any variation in this locus, likely because the SNPs characterizing the region are too small for its variation detection parameters. In contrast, pggb identified 2 SNPs in H2 and 3 in H10. Bifrost detected the same SNPs as pggb in H2 and H10, showing identical variation representation and haplotype paths. mdbg captured the heterozygosity of a larger region encompassing HLA-E as sample size increased. Given that mdbg graphs are built in minimizer space, nodes represent extended genomic segments (hundreds of thousands of base pairs). In H10 and H104, the minimizer-space representations of haplotypes were consistent, but variations in flanking graph regions led to additional nodes being extracted in this locus. In the H2 dataset, Minigraph-Cactus detected 3 variations, reflecting the different dataset composition (CHM13 reference and one HG006 haplotype) as described in the original article.

Fig. 2

HLA-E locus representations by five methods across three pangenome sizes. Red nodes contain locus sequence. Node/edge counts are for the entire subgraph. Minigraph (H2, H10, H104) and mdbg (H2) show only a portion of one node highlighted due to the 4.8 kbp region residing within a single large node.

Figure 2 also highlights the increasing complexity of pangenome graphs with dataset size. The Bifrost H104 subgraph exhibited the most variation, demonstrating dBGs’ capacity to exhaustively represent variations in large graphs. pggb, while not scalable to H104 for whole genome graphs, offered the most straightforward subgraph extraction and comprehensive variant representation for H2 and H10 datasets.

III.C. HLA Complex Locus: A High Complexity Region Analysis

Figure 3 (from the original article) presents a pangenome compare of the complex HLA region representations, mirroring Figure 2 for the simpler HLA-E locus. The interpretability of this region is more challenging due to the increased number and complexity of variations compared to HLA-E. Direct comparisons across tools also become more intricate. Base-level variations like SNPs are not readily visible in Figure 3 for methods that retain them (pggb, Minigraph-Cactus, and Bifrost) due to the sheer size of these graphs.

Fig. 3

Complex HLA region representations by five methods across three pangenome sizes. Details are as in Figure 2 caption.

Notable differences emerge in how each tool represents variation in this complex locus, particularly evident in the H2 dataset. Minigraph depicted H2 as a single sequence with a large structural variant (SV) of approximately 52 kbp. In contrast, pggb separated H2 into two paths differing by roughly 54 kbp in length. Bifrost showed a detailed “bubble” structure, containing numerous variations within each path. Extracting the complete locus from mdbg proved challenging, as many subgraph nodes were not selected by our extraction procedure. Minigraph-Cactus augmented Minigraph’s SV graph with base-level divergences between haplotypes.

These representational differences become more pronounced in the H10 dataset. pggb tended to separate haplotypes into distinct paths, Bifrost consistently rendered a compacted representation, Minigraph primarily displayed the overall structural variation landscape while missing smaller differences, and Minigraph-Cactus, similar to H2, layered small variations on top of the Minigraph structure.

IV. Uncovering Key Characteristics of Graphical Pangenome Tools

Pangenome graph construction tools are designed to facilitate genome comparisons and enable downstream applications. For effective utilization, pangenome graphs need to be stored in a format that is both accessible and efficient. Our pangenome compare identified eight crucial features for evaluating pangenome graph construction tools: (i) stability, (ii) editability, (iii) accessibility by downstream applications, (iv) haplotype compression performance, (v) ease of visualization, (vi) quality of metadata and annotation. Scalability and interpretability, also vital, were previously discussed in earlier sections. Table 2 (from the original article) summarizes the relative strengths of each tool based on these features.

IV.A. Editability and Dynamic Updates

In a rapidly evolving genomic landscape, the ability to dynamically update pangenome graphs is essential. As more high-quality genome assemblies become available, the capacity to add new haplotypes or replace existing ones with improved versions without rebuilding the entire graph from scratch offers significant computational and resource efficiency. However, many current pangenome representations rely on static data structures that are not inherently updatable.

Some tools offer limited editing capabilities. Minigraph allows for the addition of new haplotypes to an existing graph. Bifrost provides C++ APIs for adding or removing sequences, k-mers, and colors from the ccdBG. pggb, leveraging odgi, supports specific operations for deleting and modifying nodes and edges, and for adding and modifying paths within the graph. Minigraph-Cactus, being compatible with odgi, inherits the same editing capabilities as pggb. The current mdbg implementation uses a dynamic hash table but lacks a publicly exposed interface for updates.

IV.B. Stability and Reproducibility

Counterintuitively, some pangenome graph construction tools may produce varying outputs even when run multiple times with identical input haplotypes. This instability can arise from sensitivity to input sequence order or non-deterministic elements in the construction algorithm. To ensure research reproducibility, pangenome tools should ideally generate consistent outputs from the same input set, regardless of run instance or input order.

Our pangenome compare assessed tool stability through two tests: (i) running each tool three times on the same H10 dataset and (ii) running each tool twice with shuffled input sequences (altering the haplotype order in H10).

Bifrost and mdbg demonstrated perfect stability, generating identical pangenomes across all tests. This stability is inherent to the deterministic nature of de Bruijn graph construction. Minigraph produced identical graphs with identical inputs but showed slight variations when input order was permuted. This order-sensitivity is due to Minigraph’s incremental graph construction approach, where each new haplotype is aligned and integrated into the existing graph structure. Minigraph-Cactus also exhibited slight graph variations with identical inputs. pggb showed minor graph variations while maintaining the same haplotype sequences in the paths. Thus, the overall representation of input genomes remained consistent, but the variation graph topology varied slightly. The initial phases of the pggb pipeline (all-vs-all alignment and graph imputation) were consistent across runs with identical input, but differences emerged in the final smoothing steps, leading to topological variations.

IV.C. Accessibility for Downstream Applications

The practical utility of pangenome representations hinges on their accessibility for downstream analyses. De Bruijn graphs, while comprehensive, can be challenging to analyze directly due to their high complexity and redundancy from k-mer overlaps. While dBGs like Bifrost support presence/absence queries on nodes, they lack tools for more complex analyses, such as incorporating haplotype information at the k-mer level, as needed for the HLA loci analysis discussed earlier. Variation graphs with paths, like those from pggb and Minigraph-Cactus, offer greater analytical flexibility, especially when used with visualization toolkits like odgi. Minigraph, which focuses on structural variants and lacks path information, requires manual mapping of haplotypes back to the graph for haplotype-level analysis.

The choice of pangenome tool should align with the intended downstream applications. pggb and Minigraph-Cactus graphs have demonstrated superior performance over linear references for tasks like short-read mapping, genotyping, and RNA sequencing mapping. However, these are complex pipelines requiring careful parameter tuning and potentially more challenging installation and execution compared to single integrated tools. Minigraph can be a viable option for focusing on structural variation and rapid pangenome graph generation for visualization and interpretation of complex loci. dBG-based approaches like Bifrost, while retaining base-level information comparable to variation graph methods, are currently limited by a lack of readily available analysis tools, hindering their broader adoption.

IV.D. Haplotype Compression Efficiency

Pangenome graph construction can also be viewed as a method for compressing, storing, and retrieving input haplotypes. With the exponential growth of genomic data outpacing storage capacity, pangenomes offer a potential solution for space-efficient genome representation. The disk space usage reported in Table 1 (from the original article) consistently demonstrates that pangenome graphs require less storage than the sum of the sizes of individual input haplotypes.

For lossless haplotype retrieval, the pangenome representation must preserve all variations from the original sequences as paths within the graph. pggb and Minigraph-Cactus fall into this category, while Bifrost, mdbg, and Minigraph are lossy, either not storing paths or not capturing all types of variations.

The GBZ tool enables lossless compression of path-containing graph pangenomes in GFA format, achieving higher compression rates than standard gzip. Using GBZ, pangenomes generated by pggb and Minigraph-Cactus can be losslessly compressed with space savings of 3.5–5x.

IV.E. Ease of Visualization

Visualizing large graphs with hundreds of thousands of nodes or more presents a significant challenge in pangenomics. The H104 pangenomes, in particular, are difficult to visualize effectively. Among visualization tools evaluated by the Human Pangenome Reference Consortium, only Bandage could handle the Minigraph or mdbg H104 graphs, which contain millions of nodes. To improve visualization of pggb, Minigraph-Cactus, and Bifrost H10 graphs, we simplified them by collapsing isolated subgraphs representing SNPs or indels up to 10 kbp using gfatools.

IV.F. Quality of Metadata and Annotation Support

Enhancing pangenome graphs with metadata from other omics data sources could significantly increase their biological relevance. As biobanks expand rapidly, vast datasets on regulatory regions, transcriptomics, CNVs, and other medically relevant traits are becoming available. Pangenome data structures capable of leveraging such information are highly desirable. Some of the tools in our pangenome compare offer basic annotation functionalities. Bifrost allows linking data to graph vertices through C++ APIs. pggb and Minigraph-Cactus, via odgi, support annotation through path insertion or BED record integration. Minigraph and mdbg currently lack annotation features.

To effectively enrich pangenome graphs with metadata (e.g., genes, regulatory regions, known variants), compatibility with linear reference-based methods and data formats is crucial. Projecting data from a graph back to a reference genome can facilitate downstream analyses using linear coordinates. A straightforward approach to achieve this compatibility is to embed a reference genome within the pangenome graph, enabling its retrieval. Variation graphs from pggb or Minigraph-Cactus, due to their directed acyclic nature and haplotype paths, inherently store the coordinates needed for such projection. Haplotype paths are particularly valuable as they eliminate the need for additional mapping to the graph when extracting or injecting information using odgi. Minigraph, lacking haplotype paths, necessitates sequence mapping to the graph to recover haplotype information. De Bruijn graphs, using associated color data, can track k-mer membership to a reference sequence, but full haplotype reconstruction requires storing k-mer positions as well.

Table 2 Relative strengths of five pangenome graph construction tools.

Feature Bifrost mdbg pggb Minigraph Minigraph-Cactus
(1) Construction Efficacy Medium High Low Medium Low
(2) Variant Retention High Low High Low High
(3) Scalability Medium High Low Medium Low
(4) Editability Medium Low Medium Medium Medium
(5) Stability High High Medium Medium Medium
(6) Downstream Tooling Low Low High Medium High
(7) Haplotype Compression Lossy Lossy Lossless Lossy Lossless
(8) Graph Visualization Low Medium Low Medium Low
(9) Locus Zoom & Interpret Medium Medium Medium Medium Medium
(10) Annotation Functionality Low Low Medium Low Medium
(11) Reference Compatibility Low Low High Low High

V. Pangenome Compare: Concluding Remarks

This pangenome compare analysis highlights the diverse capabilities and trade-offs associated with different pangenome graph construction tools. Each tool – Bifrost, mdbg, Minigraph, Minigraph-Cactus, and pggb – offers a unique approach to representing and analyzing genomic diversity. The choice of the most appropriate tool depends heavily on the specific research question, dataset size, and desired downstream applications.

For applications requiring rapid construction and minimal computational resources, particularly with large datasets, mdbg stands out due to its exceptional speed and low memory footprint. Bifrost offers a balance of speed and comprehensive variation representation, making it suitable for scenarios where detailed variant information is crucial. Minigraph excels in efficiently capturing structural variations and providing a relatively simple graph structure for visualization and interpretation. pggb and Minigraph-Cactus, while computationally more demanding and less scalable to extremely large datasets in their current implementations, offer lossless haplotype representation, enhanced annotation capabilities, and strong compatibility with linear reference-based workflows, making them powerful choices for comprehensive pangenome analysis and integration with existing genomic resources. Ultimately, this pangenome compare provides a framework for researchers to make informed decisions when selecting pangenome graph construction tools, optimizing their approach for specific genomic research endeavors.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *