Pangenome Compare Tools: Evaluating Scalability and Performance for Genomic Diversity Analysis

The advent of pangenomics has revolutionized our understanding of genomic diversity, moving beyond linear reference genomes to embrace the complexity of multiple genomes within a species. Constructing pangenomes, however, relies on sophisticated computational tools. This article provides a comparative analysis of five prominent Pangenome Compare Tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus, and pggb. We delve into their methodologies, focusing on how each tool, employing distinct graph data structures, addresses the challenges of scalability and representation of genetic variation. Our evaluation, centered on pangenome compare tools, examines their computational performance and the characteristics of the pangenome graphs they generate, using datasets of varying sizes to simulate the increasing scale of genomic data in human pangenomics. While vg is a known tool for pangenome analysis, we excluded it from this study due to limitations in handling structural variations and reference bias, factors crucial for the advancement of pangenome graph utilization.

Scalability and Performance Benchmarking of Pangenome Graph Construction Tools

To rigorously assess the performance of pangenome compare tools, we conducted experiments using three datasets comprising 2, 10, and 104 human haplotypes (as detailed in Table 3 of the original article). Our benchmarking focused on evaluating the computational efficiency of the construction algorithms and the inherent properties of the resulting pangenome graphs. The primary objective was to determine the scalability of each method in the face of exponentially growing genomic datasets, anticipating future scenarios involving thousands or even millions of human genomes. This scalability is a crucial aspect when selecting pangenome compare tools for large-scale genomic research.

We evaluated each tool based on key performance indicators: running time, peak memory usage, and disk space required for storing the output data structures (graphs and annotations). Furthermore, to gauge graph complexity, we compared the number of nodes, edges, and connected components generated by each tool. Table 1 in the original article summarizes these findings, providing a quantitative comparison of the pangenome compare tools.

Regarding running time, mdbg demonstrated exceptional speed, outperforming other tools by two orders of magnitude across all datasets. It completed analyses in approximately two minutes for the H2 dataset and half an hour for the H104 dataset. Bifrost emerged as the second fastest for H104 (18 hours), while Minigraph was the second quickest for H2 (8 minutes). Minigraph-Cactus exhibited running times approximately ten times longer than Minigraph. Notably, pggb and Minigraph-Cactus failed to produce graphs for the H104 dataset; pggb execution did not complete within two weeks, and Minigraph-Cactus encountered an error.

In terms of memory consumption, mdbg consistently required less than half the memory compared to other tools, using only 31 GB for the H104 dataset. Minigraph followed, utilizing 61 GB for the same dataset. For the smaller H2 dataset, memory usage ranged from 8 to 66 GB across all tools.

All pangenome compare tools demonstrated reasonable disk space usage for storing the resulting graphs, remaining below 12 GB for H10 and 38 GB for H104. Despite Minigraph-Cactus and pggb being the only tools capable of directly reconstructing input haplotypes from the graph and retaining all variations, they exhibited efficient disk space utilization. For example, Minigraph-Cactus used 3.6 GB for H2 and 7 GB for H10. While Bifrost and Minigraph operate entirely in memory, pggb, Minigraph-Cactus, and mdbg store intermediate files on disk, requiring comparable space to the input size, up to three times the input size for Minigraph-Cactus.

Topological Differences in Pangenome Graphs Generated by Different Tools

Graph metrics, such as the number of nodes, edges, and connected components, offer valuable insights into the level of detail in variation representation and the overall complexity and navigability of the pangenome graph. Understanding these topological differences is essential when choosing the most appropriate pangenome compare tools for specific research questions.

Across the H2 dataset, the number of graph nodes varied significantly, ranging from 17,000 to 11 million depending on the pangenome compare tools used. In all cases, the number of nodes was significantly smaller (at least three orders of magnitude) than the total number of bases in the input haplotypes. This highlights the effectiveness of pangenome graphs in compressing linear haplotype sequences. Tools like Minigraph and mdbg, which discard certain variations, resulted in graphs with approximately (10^4)–(10^5) nodes across all datasets. Conversely, tools that retain comprehensive variation, including Bifrost, Minigraph-Cactus, and pggb, generated graphs with (10^6)–(10^7) nodes. Interestingly, the node count increased sublinearly with dataset size; moving from H10 to H104 datasets resulted in only a fivefold increase in the number of nodes.

The number of connected components also varied widely, from 2 to 1402, depending on the method and dataset. The number of large components (containing >1% of total base pairs) ranged from 1 to 30. Ideally, a pangenome graph representing complete chromosomes should exhibit 24 connected components (one per nuclear chromosome, excluding mitochondria). Minigraph closely approached this ideal, producing 24 large connected components, mirroring the chromosome count in the CHM13 v2.0 reference. Bifrost and Minigraph-Cactus generated graphs with fewer than 25 connected components, while mdbg and pggb resulted in more than 25. In Bifrost’s de Bruijn graph (dBG), the majority of sequences (>99.99%) resided within a single giant component due to chromosome joining via shared k-mers. This joining was not observed in mdbg for the H2 dataset, which presented 24 large components, possibly due to the absence of sufficiently long and similar inter-chromosomal regions in this smaller dataset. Minigraph did not incorporate mitochondrial sequences from the input haplotypes, whereas Minigraph-Cactus graphs did include them.

While chromosome-by-chromosome analysis is common in pangenomics, our study intentionally used whole genomes as input to evaluate tool scalability and to potentially identify inter-chromosomal events like inversions and translocations. The effects of this whole-genome approach are evident in the pggb and Minigraph-Cactus H10 variation graphs (Fig. 1 of the original article). The pggb graph exhibited a giant component linking 19 chromosomes, containing 25 million nodes and 83% of total basepairs. The remaining 859 components, representing only 4.7% of total bases, were attributed to smaller input haplotype sequences. In the Minigraph-Cactus graph, all chromosomes, except chromosome 18, were linked in a single giant component. Chromosome 18 formed a separate component, and sexual chromosomes (X and Y) were connected in another component.

Fig. 1

Pangenome graph construction and visualization workflow. A: Overall workflow using five pangenome compare tools on three datasets. B: Complete 104 haplotypes variation graph built by Minigraph. C: HLA (MHC) region focus on chromosome 6 from panel B. D: DRB1-5 locus of HLA from panel C. E: Complete 10 haplotypes variation graph built with pggb. F: 10 haplotypes variation graph built with Minigraph-Cactus. G: 104 haplotypes pangenome mdbg. H: 10 haplotypes Bifrost dBG. Graphs simplified (except Minigraph) using gfatools and rendered by Bandage. VG: variation graph.

Full size image

Table 1. Performance comparison of pangenome graph construction tools.

Table 1
Full size table
Time, memory, disk space, nodes, edges, connected components comparison for Bifrost, mdbg, pggb, Minigraph, and Minigraph-Cactus across different haplotype counts. Minigraph-Cactus times include Minigraph graph construction. pggb failed to complete on the largest dataset after two weeks. Minigraph-Cactus failed on the 104 HAP dataset.

Interpreting Variation within Pangenome Graphs: HLA Loci Case Study

The effectiveness of pangenome compare tools is also reflected in their ability to detect and represent variations within input haplotypes. Previous research suggests focusing on specific loci rather than whole genomes for identifying genomic diversity and mapping reads to complex regions. Here, we assess how pangenomes constructed from complete haplotypes represent biologically significant loci, specifically using two HLA (Human Leukocyte Antigen) loci as examples.

HLA-E and Complex HLA Region Extraction from Pangenome Graphs

We extracted regions corresponding to two HLA loci from the complete pangenome graphs. The HLA complex is medically crucial due to its association with numerous disease variants. The first locus, HLA-E, a relatively conserved 4.8 kbp region within nonclassical class I genes, is linked to COVID-19 severity. The second, a more complex 58 kbp HLA region encompassing the HLA-A gene (highly polymorphic class I) and genes like HLA-U, HLA-K, HLA-H, and HCG4B, presents a greater challenge for variation representation. We utilized a custom script to extract subgraphs for these loci from each pangenome graph, employing method-specific approaches. For variation graphs and mDBGs, nearby nodes in the aligned region reflect locus variations. However, accurate locus representation in standard dBGs remains a challenge.

HLA-E: Representing a Low Complexity Region

Figure 2 in the original article illustrates the HLA-E locus representation across H2, H10, and H104 datasets for each pangenome compare tool. As anticipated, Minigraph detected no variation, as SNPs in this region are too small for its algorithm. pggb identified 2 SNPs in H2 and 3 in H10, while Bifrost mirrored pggb’s SNP detection and haplotype path representation in H2 and H10. mdbg captured heterozygosity in a broader region around HLA-E as sample size increased. Although mdbg graphs are built in minimizer space (nodes representing large genomic segments), differences in flanking regions led to variations captured in extra nodes. On H2, Minigraph-Cactus detected 3 variations, likely due to dataset differences (including CHM13 reference and one HG006 haplotype).

Fig. 2

HLA-E locus representations using five pangenome compare tools across increasing human pangenome sizes. Red nodes contain locus sequence. Node/edge counts below each graph are for the entire subgraph. Minigraph (H2, H10, H104) and mdbg (H2) show only a portion of one node highlighted due to the 4.8 kbp region being within a single long node.

Full size image

Figure 2 also demonstrates pangenome complexity growth with genome number. Bifrost’s H104 subgraph exhibited the most variation, highlighting dBGs’ exhaustive variation representation in large graphs. pggb offered straightforward subgraph extraction and also exhaustively represented variants in H2 and H10, but did not scale to H104.

Complex HLA Locus: A High Complexity Region

Figure 3 in the original article presents the complex HLA locus representations, contrasting with the simpler HLA-E locus. Interpreting variation in this region is more challenging due to increased variation complexity and structural differences. Comparing tools is also more difficult. Base-level variations (SNPs) are not visually discernible in methods retaining them (pggb, Minigraph-Cactus, Bifrost) due to graph size.

Fig. 3

Complex HLA region representations using five pangenome compare tools across increasing human pangenome sizes. Details as in Figure 2 caption.

Full size image

Notable differences in variation representation emerge in the H2 dataset. Minigraph represents H2 as a single sequence with a large (~52 kbp) structural variant. pggb separates it into two paths differing by ~54 kbp in length. Bifrost shows a detailed bubble containing numerous variations within each path. mdbg subgraph extraction proved challenging, with many nodes unselected. Minigraph-Cactus adds base-level divergences to Minigraph’s structural variant graph.

These representation differences are amplified in the H10 dataset. pggb tends to separate haplotypes into distinct paths. Bifrost maintains a consistently compacted representation. Minigraph accurately displays the general structure but misses smaller differences. Minigraph-Cactus, similar to H2, adds minor variations atop the Minigraph structure.

Key Features of Graphical Pangenome Tools for Genomic Analysis

Effective pangenome compare tools should facilitate comparisons between input genomes and be readily accessible for downstream applications. We identified eight crucial features for evaluating pangenome compare tools: (i) stability, (ii) editability, (iii) accessibility for downstream applications, (iv) haplotype compression performance, (v) ease of visualization, (vi) quality of metadata and annotation. Scalability and interpretability were previously discussed. Table 2 in the original article summarizes the relative strengths of these tools across these features, providing a concise overview for users selecting pangenome compare tools.

Editability and Dynamic Updates

With the continuous generation of high-quality assemblies, the ability to update pangenomes by adding or replacing haplotypes efficiently is crucial. Updating existing data structures is computationally and energetically more efficient than rebuilding from scratch. However, many succinct pangenome representations are static. Some methods offer limited editing capabilities. Minigraph allows adding haplotypes to existing graphs. Bifrost provides C++ APIs for adding/removing sequences, k-mers, and colors. pggb and Minigraph-Cactus, using odgi, support operations to delete/modify nodes/edges and add/modify paths. While mdbg uses a dynamic hash table, it lacks an update interface.

Stability and Reproducibility

Ideally, pangenome compare tools should produce consistent outputs given the same input haplotypes, regardless of input order or repeated runs. Instability can arise from input sequence permutation or non-deterministic algorithms, hindering result reproducibility. We tested tool stability by running each tool three times on the H10 dataset and twice with shuffled H10 input sequences.

Bifrost and mdbg consistently produced identical pangenomes due to the inherent stability of de Bruijn graphs. Minigraph generated identical graphs with identical inputs but slightly different graphs with permuted input, reflecting its order-sensitive construction algorithm. Minigraph-Cactus also showed slight variations with identical input. pggb produced slightly different graphs while preserving haplotype sequences in paths. Most topological variations in pggb graphs arose from smoothing steps, while the initial alignment and imputation phases remained consistent.

Accessibility for Downstream Applications

Pangenome representations should be easily accessible for downstream analyses. De Bruijn graphs, despite supporting presence/absence queries, are challenging to analyze due to complexity and redundancy. Variation graphs with paths (pggb, Minigraph-Cactus with odgi) offer greater flexibility for analysis. Minigraph, lacking path information, requires manual haplotype mapping. The optimal pangenome compare tools choice depends on the intended application. pggb and Minigraph-Cactus graphs excel in short-read mapping, genotyping, and RNA sequencing mapping, outperforming linear references. However, their complex, multi-tool pipelines can be more challenging to install and run than integrated tools. Minigraph is suitable for structural variation focus and rapid pangenome graph generation for visualization. dBG-based approaches like Bifrost retain base-level information but lack analysis tools limiting their broader application.

Haplotype Compression and Storage Efficiency

Pangenome graph construction can also be viewed as a method for haplotype storage and compression. With assembly data growing faster than storage capacity, pangenomes offer potential storage savings. Table 1 shows that all pangenome compare tools consistently produce graphs requiring less disk space than the summed size of input haplotypes.

For lossless haplotype retrieval, pangenome representations must store all variations as paths in the graph. pggb and Minigraph-Cactus achieve this, while Bifrost, Minigraph, and mdbg are lossy as they don’t store paths or consider all variations.

The GBZ tool further enhances storage efficiency by losslessly compressing GFA-formatted pangenome graphs with paths, achieving 3.5–5× space reduction for pggb and Minigraph-Cactus pangenomes.

Ease of Visualization and Interpretation

Visualizing large graphs (>100,000 nodes) is a significant challenge. For H104 pangenomes, only Bandage effectively visualized Minigraph and mdbg graphs (containing millions of nodes). For pggb, Minigraph-Cactus, and Bifrost H10 graphs, node and edge counts were reduced by collapsing smaller subgraphs (using gfatools).

Metadata and Annotation Integration

Enhancing pangenome graphs with omics data (regulatory regions, transcriptomics, CNVs) increases their biological relevance. Some pangenome compare tools offer basic annotation functionalities. Bifrost allows linking data to graph vertices via C++ APIs. pggb and Minigraph-Cactus/odgi support annotation through path or BED record insertion. Minigraph and mdbg lack annotation features. Compatibility with linear reference-based methods is desirable for metadata integration. Projecting graph data to a reference genome facilitates downstream linear coordinate-based analyses. Storing a reference genome within the pangenome graph is a potential solution. Variation graphs from pggb/Minigraph-Cactus, due to their directed acyclic nature and haplotype paths, inherently store coordinates for this task. Haplotype paths are crucial, avoiding extra graph mapping. Odgi tools facilitate information extraction and injection. Minigraph lacks haplotype paths, requiring sequence mapping for haplotype information retrieval. dBGs, using color data, can record k-mer membership to a reference but require additional k-mer position storage for full haplotype reconstruction.

Table 2. Feature comparison of pangenome graph construction tools.

Table 2
Full size table
Relative strengths of five pangenome compare tools across key features: algorithm efficacy, variant retention, scalability, editability, stability, tool ecosystem, haplotype retention/efficiency, graph visualization/interpretation, region zoom/variant interpretation, annotation capabilities, and reference genome integration.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *