Gregorio Iraola and John D. Boyce Pathogenomics is the use of genome sequences and functional genomic techniques to inform pathogenesis research. The underlying theory of pathogenomics is that the genomes of pathogens change over time through point mutations, genomic deletions, DNA acquisitions and rearrangements, and individual bacterial cells with different genetic makeups may have altered survival under different environmental and host conditions. Natural selection leads to amplification of individuals that are most fit under a given condition and gives rise to populations with different virulence and/or host specificity characteristics (Figure 3.1). Over very long time scales, this can lead to the evolution of different species. Pathogenomics techniques aim to identify, understand, and explain how these genetic differences have arisen and how they impact the virulence phenotypes of the pathogen. Bioinformatic analyses of pathogen genomes can give significant insight into the selective pressures, both in the environment and during disease, that have shaped evolution of the species and can identify genes crucial for pathogenesis. Detailed comparative analyses of related pathogens can identify specific genes that underlie their different virulence phenotypes. Additionally, analysis of gene expression and protein production of bacterial pathogens under different growth conditions, including natural and model infections, can link particular genes with specific functions and host interactions. Together, these studies can illuminate pathogen proteins crucial for successful infection and identify targets for therapeutic and/or prophylactic intervention. While pathogenomics can give important insights into pathogenesis even when only a limited number of genome sequences is available, it becomes increasingly powerful and flexible with increased genomic information. Thus, the recent explosion in genome sequences has seen a concomitant rise in the use of pathogenomics techniques. As with all bioinformatics‐based methods, pathogenomics is generally used for the generation of new hypotheses. Thus, the predictions made from pathogenomics investigations allow new hypotheses to be proposed but, wherever possible, these need to be tested experimentally. Thus, pathogenomics generally does not replace classical experimental techniques but augments them. Importantly, the difficulty in completing pathogenomics analyses is generally independent of the genetic tractability of the species or strain being studied. These techniques have brought major improvements in the understanding of pathogenesis for those pathogens where classical molecular biology techniques have proved very difficult, such as Chlamydia trachomatis, Treponema pallidum, and Mycobacterium species (Nunes et al. 2013; Bottai et al. 2014; Jaiswal et al. 2020). Such techniques may even be amenable to those species that are currently not able to be cultured. Bacterial genomes change over time as a result of mutational mechanisms and evolutionary forces such as natural selection (Figure 3.1). As discussed further in Chapter 2 (e.g. Figure 2.2), the primary mechanisms of genetic change are gene mutation, gene acquisition and loss, and genome rearrangements. Gene mutations generally occur via small‐scale changes such as point mutations and deletions and insertions. Point mutations occur primarily because of the low‐level inaccuracy inherit in DNA replication and the chemical instability of DNA. While replication fidelity is very high, low rates of incorrect base incorporation occur due to tautomeric DNA forms and wobble base pairing (Szymanski et al. 2017). These result in incorrect nucleotides being incorporated during replication. DNA bases also show a low level of chemical instability and are reactive with certain chemicals. Loss of the base to give an apurinic site occurs 100–1000 times each day per cell (Thompson and Cortez 2020). In addition, numerous chemicals can modify the structure of DNA by addition of amine or methyl groups; once modified, the hydrogen bonding profile of the base is altered, and this can lead to incorrect base pairing during future replication cycles. All bacteria express multiple partially redundant DNA repair systems. These systems include nucleotide excision repair, base excision repair, mismatch repair, recombination repair, and photoreactivation, so most replication errors are rapidly repaired. However, the systems are not perfect, and some mutations will not be corrected before a second round of replication, after which they will usually be unrepairable. When single base mutations occur within the coding region of a gene they can result in missense mutations, which lead to amino acid substitutions that may affect protein function depending on their precise position and particular amino acid change, or nonsense mutations, which lead to early termination of protein translation and almost always inactivate protein function. Small‐scale insertion/deletion mutations (indels) can also occur during replication, especially in regions of simple sequence repeats or when intercalating chemicals are present. Such mutations within coding sequences can give rise to frameshift mutations that almost always inactivate protein function. Small‐scale point mutations and indels generate genotypic diversity in populations. Nonsense and frameshift mutations generally lead to loss of protein function. Such inactive genes are called pseudogenes; accumulation of pseudogenes (genome reduction) is a very common evolutionary step for pathogens (e.g. pathogens that have transitioned from free‐living organisms to obligate intracellular pathogens), and genome reduction may be a common feature of the evolution of pathogenicity (Murray et al. 2021b). An important class of frameshift mutations are those associated with phase variation, where the presence of regions of simple sequence repeats gives rise to a very high rate of frameshifting at these sites (van der Woude and Bäumler 2004). Some bacterial species have evolved to have these simple sequence repeats within the coding sequence or regulatory region of genes that encode virulence, antigenic or regulatory proteins. Frameshift mutations at these sites generally results from slipped‐strand mispairing during replication and results in expansion or contraction of the number of simple sequence repeats. If the repeats are within a coding sequence, and are not a multiple of three bases, expansion or contraction of the repeats leads to a frameshift mutation that usually changes the encoded protein from either active to inactive or from inactive to active. Thus, the gene can cycle between producing an active or inactive protein. The high rate of frameshift mutations at these sites means that any population of cells of a species where phase variation occurs, generally includes cells with in‐frame and out‐of‐frame mutations; some cells will therefore express an active version of the protein while others will not. This gives rise to genetic and phenotypic diversity in the population, allowing natural selection of the fittest individual cells in any particular host or environmental niche (van der Woude and Bäumler 2004). Some bacteria, such as Neisseria ssp. may have more than 50 phase‐variable genes and therefore 250 different possible protein expression combinations. Gene acquisition via horizontal gene transfer is also a major driver of bacterial evolution and has played a crucial role in the evolution of virulence and antimicrobial resistance. Bacteria can acquire DNA via three primary mechanisms: transformation, transduction and conjugation (Sun 2018; see also Chapter 2). Once DNA has been transferred via one of the three mechanisms, its fate depends on the DNA sequence and its ability to replicate independently and/or the likelihood of DNA recombination into the recipient cell genome. Plasmids are independently replicating DNA molecules so transfer of these molecules often results in long‐term acquisition as a DNA molecule separate from the host chromosome, although successful acquisition depends on whether the replication machinery of the plasmid is compatible with the new host. Many plasmids are known to encode genes crucial for the virulence of the host strain (Tivendale et al. 2009; Zhong 2017). Many mobile DNA elements, such as insertion sequences, transposons and lysogenic bacteriophage, encode proteins that direct their own integration into the host chromosome. Therefore, uptake of these DNA sequences also usually results in long‐term acquisition and these sequences become part of the host chromosome. Numerous virulence genes, including many encoding toxins, are carried on bacteriophage (Penadés et al. 2015) and antibiotic resistance genes are commonly carried on mobile elements. Uptake of DNA that lacks genes encoding replicative and/or integrative systems can still become integrated into the host genome via homologous recombination under some circumstances, although this likely occurs infrequently and rates will be related to the similarity of incoming DNA sequences and the host genome. Naturally competent species often show a strong bias for taking up and incorporating DNA from related species/strains (Mell and Redfield 2014). In addition, bacteria have evolved a system whereby some incoming DNA is digested and integrated into a specific position on the genome. Clustered regularly interspaced short palindromic repeats (CRISPR) are found in the majority of bacteria. Incoming DNA is digested and integrated into the CRISPR regions between spacer sequences. Once integrated into the genome, the sequences function as identifiers of incoming DNA and direct the action of the CRISPR‐associated (Cas) nucleases to degrade related incoming DNA in what is probably primarily a defense mechanisms against invading DNA (Rath et al. 2015). Thus, these repeats form a rudimentary “immune system” by allowing bacteria to respond to and inactivate incoming DNA. Accurate genome sequencing underpins all pathogenomic investigations. Early bacterial genome sequences (approximately pre‐2005) were completed with Sanger dideoxy sequencing, which is seen as the first‐generation sequencing standard. This sequencing by synthesis method relies on the use of fluorescently labeled dideoxynucleotides (ddNTPs) in addition to normal deoxynucleotides (dNTPS) during DNA synthesis. As the ddNTPs lack the 3′OH group, which is essential for forming the phosphodiester bond with the next incoming base, insertion of a ddNTP always terminates synthesis. Thus, DNA synthesis in the presence of ddNTPs generates fragments of different lengths with each individual synthesized molecule terminated following random incorporation a fluorescently labeled ddNTP. Separation of the fragments by mass (through an acrylamide gel or capillary) allows the nucleotide sequence of the synthesized strand to be determined by reading the fluorescent colors of all the fragments from smallest to largest. Sanger dideoxy sequencing is an exceptionally accurate sequencing method (< 1 error/100 000 bases) but has a number of problems that make it less than ideal for rapid and cheap whole‐genome sequencing due to difficulty in parallelization (Shendure and Ji 2008). First, the ability to determine sequences using this method relies on being able to separate all DNA fragments that differ by only a single base. This becomes increasingly difficult for larger fragments and means that Sanger sequencing has a maximum read length of around 1000 bp. The fragment separation step is also very difficult to automate, with standard machines able to run only 96 samples at a time. Second, as all sequencing reactions have to start from a known sequence (initiating primer binding), Sanger sequencing of whole genomes usually requires shotgun cloning of genomic fragments into one or more plasmid libraries, so that sequencing can begin from a known plasmid sequence. Generation of such libraries can be time consuming and invariably some genomic fragments cannot be cloned. Given that the length of bacterial genomes is in the order of 1–10 Mbp, many thousands of Sanger sequencing reactions must be completed to cover all genomic DNA fragments. These individual sequences must then be assembled in the correct order to recreate the original genome sequence. The early years of the twenty‐first century saw an explosion in new sequencing technologies. These second‐generation sequencing methods were developed to overcome the main problems with Sanger sequencing; namely the gel separation of labeled fragments and the use of shotgun plasmid fragment libraries, so that the time and cost of whole‐genome sequencing could be reduced. These new sequencing methods included 454 (454 Genome Sequencer), SOLiD (Applied Biosystems), Ion Torrent (Ion Torrent Systems Inc.) and Solexa/Illumina (Illumina Genome Analyzer; Slatko et al. 2018). All these methods used sequencing by synthesis but were quite varied in their specific mechanisms. Common elements included shearing of the genomic DNA and ligation of common oligonucleotides (linkers/adaptors) to the ends of all fragments, to give common sequences for fragment amplification and binding of the initiating sequencing primer. All methods also included a step for clonal amplification of fragments to allow for detection of each base addition. The most important of these methods for genome sequencing has been the Illumina method. In this method genomic DNA is sheared into approximately 300–600 bp fragments and common linkers/adapters are ligated on to the ends of each fragment. The fragments are then bound to a solid surface (the flow cell), which is covered in a lawn of primers complementary to the adapters ligated to the genome fragments. The presence of primers on the flow cell that are complementary to both ends of the fragments allows for the clonal amplification of each fragment in the local vicinity via a method called bridge PCR or bridge amplification (Slatko et al. 2018). The sequencing steps use fluorescently labeled nucleotides that are chemically capped so that only one nucleotide can be added at a time. Following each nucleotide addition, the flow cell is imaged, and the color and position of every spot recorded. The nucleotides are then chemically uncapped and the cycle of nucleotide addition, imaging and uncapping repeated until the sequence is determined. Initially, read length was only around 30 nt but improvements in chemistry have increased read length to around 300 nt. This sequencing method allows for massive parallelization, with the number of fragments to be simultaneously sequenced primarily determined by the resolution of the imaging system; current systems can resolve billions of individual fragments on a single flow cell. While this and the other second‐generation sequencing methods have massive throughput and low cost, they generate shorter read lengths and have a higher error rate (approximately 1 error/200 bases) than Sanger sequencing (Stoler and Nekrutenko 2021). However, as the Illumina error profile is mostly random the majority of errors can be identified and removed by having high read depth. However, the short‐read length makes assembly problematic across areas of sequence repeats. The most recently developed sequencing technologies aim to overcome the short‐read length and assembly problems associated with second‐generation methods. Two third‐generation methods, Single Molecule Real‐time Sequencing (SMRT; Pacific Biosciences) and Oxford Nanopore Technologies (ONT; Oxford Systems) both produce individual sequencing reads greater than 10 kb, which significantly improves genome assembly (Slatko et al. 2018). SMRT is a sequencing by synthesis method that uses a single tethered polymerase molecule per well and fluorescently labeled natural nucleotides. In this method, the fluorescent label is present at the phosphate end of the dNTP, so is lost after each base addition, allowing the polymerase to very rapidly extend a natural nucleotide sequence; the short burst of fluorescence is captured in real time during each base addition. The additional benefit of this system is that methylated bases can be identified by virtue of the slight increase in time for addition of a base opposite a methylated base, thus SMRT sequencing can be used to determine the methylome during genome sequencing. The ONT system is currently the only commonly used method that is not sequence by synthesis. Instead, this system passes the DNA strand to be sequenced through a protein pore in a synthetic membrane across which there is a constant applied voltage. Each base occludes current flow differently, allowing the sequence to be determined. Both SMRT and ONT systems currently have significantly higher error rates than the second‐generation methods (> 1 error/50 bases), although error rates are continually improving. Given the very long reads but high error rates of third‐generation methods, current genome sequencing projects often combine both technologies and use mixed model assembly methods. Thus, an assembled complete genome sequence may be constructed from sequences generated by one of the third‐generation methods and then error‐corrected using Illumina sequences; however, methods for high accuracy assembly of just third‐generation sequencing data are continually improving (Salmela et al. 2016). As noted above, sequencing of bacterial isolates using the Illumina technology, generates millions of short DNA fragments (100–300 bp) that need to be assembled correctly prior to performing any detailed pathogenomic investigations (Figure 3.1). Genome assembly refers to the process of taking this large number of short DNA sequences (typically termed reads) and putting them back together in the correct order to create a representation of the original DNA molecule (bacterial chromosome, plasmids, etc.) from which it originated. This process relies on computational programs (genome assemblers) that take short‐read data as input and try to reconstruct longer, contiguous sequences called contigs. Popular assemblers such as Velvet (Zerbino and Birney 2008) or SPAdes (Bankevich et al. 2012) use De Bruijn graphs to perform the assembly. These are directed graphs whose vertices are substrings of a defined length (also known as k‐mers) generated from sequencing reads. An edge exists between two vertices if they are consecutive k‐mers in a read with a difference of one base. To obtain contigs, the graph is automatically analyzed by the computer to identify paths in which all vertices are connected to another single vertex. Then, assembly is extended until a certain contig boundary that does not satisfy this condition. One fundamental limitation of reconstructing genomes using short‐read data is the impossibility of recovering the complete sequence of information as it naturally occurs in the DNA molecule. This is mainly caused by the presence of genomic repeats, defined as DNA segments repeated in nearly identical form across the genome. As the short reads produced from these regions will be nearly identical, genome assemblers cannot resolve a single path in the De Bruijn graph, resulting in the termination of the contig. In prokaryotic genomes, repeated regions have several biological origins, including transposable elements, prophages, plasmids, highly‐conserved gene clusters (i.e. the ribosomal RNA operons) or multi‐copy gene families, such as the PE/PPE gene family in Mycobacterium tuberculosis, which accounts for up to 10% of its genome (Brennan 2017). These repeated regions cannot be reconstructed by simply oversequencing, and lead to significant ambiguity in the reconstruction of complex prokaryotic genomes such as those from highly recombinogenic species. For example, using sequence reads as short as 30 nt allows accurate reconstruction of 75% of the Escherichia coli genome in contigs longer than 10 kb and with 96% of genes covered in a single contig. This compromises the high‐resolution analysis of relevant aspects of bacterial biology, such as horizontal gene transfer and evolution of antimicrobial resistance genes. The use of longer reads produced by ONT and SMRT technologies (typically ranging from 1 kb to > 1 Mb), allows the major limitation (incomplete assembly) of short‐read sequencing to be overcome. However, due to the high error rates reported by early long‐read sequencing chemistries, hybrid assembly has emerged as a promising strategy to generate fully resolved and accurate bacterial genome assemblies. In hybrid assemblies, both short‐read and long‐read sequencing datasets are used, as error‐prone long reads provide information regarding the structure of the genome and accurate short reads facilitate detailed assembly at single‐nucleotide level, allowing error correction of the long reads. The hybrid assembly tool Unicycler has been shown to outperform other hybrid assemblers in generating fully closed genomes and is today one of the most widely used for this purpose (Wick et al. 2017). The constant improvement of long‐read sequencing chemistries and base‐calling algorithms has significantly reduced the error rate of these technologies. This has opened the possibility to generate long‐read‐only assemblies with higher accuracy. The comparison of several state‐of‐the‐art long‐read assemblers concluded that Flye (Kolmogorov et al. 2019), Miniasm/Minipolish (Li 2016) and Raven (Vaser and Šikić 2021) are the best methods for assembling prokaryotic genomes (Wick and Holt 2019). Flye is particularly useful for assembling plasmids and/or when overall sequencing depth is low. Miniasm/Minipolish is more reliable for producing fully circularized contigs and Raven allows reliable chromosome assembly and is tolerant of low identity reads. Gene prediction refers to the process of identifying the segments in a genome that encode genes (Figure 3.1). This typically includes protein‐coding genes and RNA genes but may also include prediction of additional functional elements such as promoters and other regulatory regions. In prokaryotic genomes, genes that encode a single protein typically occur as a single contiguous open‐reading frame (ORF) spanning from hundreds to a few thousand bp (around 1000 bp on average). As translational start and stop codons are well‐characterized, prokaryotic gene prediction based on these features is relatively straightforward using automated tools that achieve high levels of accuracy. Complementary sequence signals derived from initially detected ORFs are also used for prediction, including compositional statistics (k‐mer counts, GC content, etc.) and sequence and frame length. A combination of all of these features is used by Prodigal, one of the most popular and straightforward software tools developed so far for prokaryotic gene prediction (Hyatt et al. 2010). Following the identification of ORFs, genome annotation is the process of giving biological sense to the key features of a genome, in particular, the genes and their products. Currently, genome annotation is mostly performed automatically by comparing each predicted gene to sequences from several databases. Protein or RNA sequence databases are used to identify sequence similarities with previously known genes from reference genomes. One example of this is the National Center for Biotechnology Information (NCBI) Reference Sequence database, which collects > 200 000 curated prokaryotic genomes enclosing annotations for > 150 million protein‐coding genes (Li et al. 2021). Protein or RNA domain databases, like Pfam (El‐Gebali et al. 2019), Rfam (Kalvari et al. 2018) or the NCBI Conserved Domain Databases (Yang et al. 2020) comprise protein and RNA domains conserved during evolution. These provide conserved functional site annotations of protein and RNA sequences. Evolutionarily conserved domains help transfer the functional annotation from a known domain model to an unknown sequence and group them into superfamilies, families and subfamilies concordant to their molecular functions. Metabolic databases are used for reconstruction of metabolic pathways or biosynthetic gene clusters. The Kyoto Encyclopedia of Gene and Genomes (KEGG); (Kanehisa et al. 2017) aims to assign functional meanings to genes and genomes both at the molecular and higher levels. Molecular‐level functions are maintained in the KEGG Orthology database, where each definition is built on a specific set of ortholog genes. Higher‐level functions are represented by networks of molecular‐level interactions, reactions and relations that are placed on the KEGG Pathways maps (Kanehisa et al. 2017). The Clusters of Orthologous Genes (COGs) database has been a popular tool for functional annotation of bacteria and archaea for the past 25 years. It consists of a minimalist set of around 5000 COGs that describe the most widespread bacterial and archaeal genes (Galperin et al. 2021). The eggNOG (Evolutionary Genealogy of Genes: Non‐Supervised Orthologous Groups) database has recently become one of the most popular resources for functional annotation of microbial genomes. This database is a public resource built on a set of thousands of genes that are analyzed at once to determine orthological relationships between them. Compared with similar databases, eggNOG provides: (i) a comprehensive functional annotation for the inferred orthologs, (ii) predictions across thousands of bacterial and archaeal genomes, and (iii) hierarchical resolution of orthologs based on phylogenetic analysis (Huerta‐Cepas et al. 2019). Currently, all steps from gene prediction from raw genome assemblies to functional annotation following comparison with the above‐mentioned databases are automated by different software tools that can be run using local command line or remote applications. There are a handful of online annotation web servers like the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (https://www.ncbi.nlm.nih.gov/genome/annotation_prok), the Rapid Annotation using Subsystem Technology (RAST) server (https://rast.nmpdr.org), MicrobeAnnotator (https://github.com/cruizperez/MicrobeAnnotator) or DFAST (https://dfast.ddbj.nig.ac.jp) that provide fully automated, easy‐to‐use pipelines for comprehensive functional annotation without the need for high‐level bioinformatic expertise. The most popular and widely used tool for prokaryote genome annotation using the command line is Prokka (Seemann 2014), which has become the standard software for this purpose due to its efficiency; it can annotate a typical bacterial genome in less than 10 minutes using a standard desktop computer. More recently, a newly developed and command‐line driven version of DFAST (Tanizawa et al. 2018) also enables rapid and straightforward submission of annotated genomes to public repositories such as NCBI. Currently, annotation relies almost entirely on comparison of novel sequences with sequences of known function; thus, novel functions cannot be predicted from sequence alone. This is an important impediment to complete genome annotation. The recent development of AI software Alphafold, which can predict 3‐D protein structure from amino acid sequence with improved accuracy, may allow for the development of more comprehensive annotation strategies, although this is not yet widely implemented (Perrakis and Sixma 2021). An important caveat of genome annotation, when the base sequence assemblies are derived entirely from short‐read data alone, is that the draft genomes may contain dozens of disconnected contigs and therefore many genes will be fragmented at the contig boundaries. This makes it difficult for annotation software to annotate these genes, resulting in the annotation of multiple gene fragments placed on different contigs with the same descriptions. This may lead to overestimation of the total gene count and individual gene copy number. Some tools, such as GenAPI (Gabrielaite and Marvig 2020), have been developed to improve this but there is no simple way that annotation software can overcome the limitation of fragmented assemblies. The best solution is to improve genome contiguity by using hybrid assemblies followed by reannotation.
3
Understanding Pathogenesis Through Pathogenomics and Bioinformatics
Introduction
How Mutations Generate Bacterial Diversity
Genome Sequencing Technologies
Genome Assembly
Genome Assembly Using Short‐Read Data
Genome Assembly Using Long‐Read Data
Gene Prediction and Annotation
Defining Prokaryotic Species from Genomes