Functional Impact of the Genome
Genomic DNA contains the “informational blueprint” of an organism, which has now been elucidated in the horse by sequencing of the equine genome (Wade et al., 2009). Understanding the relationship between the primary DNA sequence and biological function, however, requires the identification and characterization of functional sequence elements together with their regulation and interactions.
Functional elements of the encoded genome
Genomic DNA contains many different functional elements. Three broad functional classes are: (1) protein-coding genes, (2) non-coding RNA genes, and (3) regulatory sequences associated with transcription (The ENCODE Project Consortium, 2004). Annotation of these elements is critical to interpreting the information contained within the genome sequence and applying it to the study of equine biology.
Protein-coding genes are often the best-characterized functional elements in a genome. The first level of structural annotation is a gene’s location or genomic interval. Important features within this interval include the transcriptional start and stop sites, exon/intron boundaries, translational start and stop sites, conserved domain sequences, polyadenylation signals, and the 5′ and 3′ untranslated regions. Any change to these elements or their specific combination has the potential to alter the structure and function of the encoded gene product or expression characteristics. The gold standard for the annotation of protein-coding gene structure is the generation and alignment of full-length cDNA sequence (Brent, 2008). Alternatively, sequence homology and in silico sequence analyses can be used to accurately predict and annotate gene structures. After sequencing and assembly of the equine genome (Wade et al., 2009), these strategies were used with the Ensembl Automatic Gene Annotation System (Curwen et al., 2004) and NCBI Gnomon eukaryotic gene prediction tool (http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml) to predict equine gene structure. Both approaches are computational pipelines that rely on heuristic sequence alignment algorithms (BLAST) and ab initio gene modeling. The resulting predicted gene sets represented an important development in horse genomics, describing for the first time the “whole” set of equine protein-coding genes. At the time of these analyses, however, there were very limited equine expressed sequence data available in public databases to facilitate gene structure predictions (only 35,702 ESTs and 1,236 mRNA transcripts, UCSC Genome Browser). As a result, a substantial majority of the predictions (20,436 genes from Ensembl; 17,610 genes from NCBI) were based on the projection of gene structure from other mammalian species. Most recently, messenger RNA sequencing (RNA-seq) experiments have expanded available equine transcriptome data several hundred fold and enabled many refinements to the in silico predictions (Coleman et al., 2010b). Revised consensus models of equine protein-coding gene structure have been generated from these efforts (http://macleod.uky.edu/equinebrowser/) and will continue to improve.
Protein-coding genes account for only a small portion of the entire equine genome sequence (1–2%). Recent findings suggest that a majority of the remaining genome may also be transcribed into a diverse population of different RNA species. These RNAs are collectively referred to as non-coding (i.e., do not encode a protein), though they clearly have a number of important functions. Several of the non-coding RNA types are well known (rRNA, tRNA, snoRNA, and miRNA), with functions that include involvement in the translation of messenger RNA into protein. However, the population of identified non-coding RNAs has grown substantially in recent years to comprise more than 30 different classes of transcripts. Most are believed to have regulatory functions, though their specific roles remain an area of very active investigation. Annotation of these elements will be required for a full understanding of transcriptional and translational dynamics and how information stored in the genome is accessed, regulated, and functionally important. Commonly used approaches for locating and annotating non-coding transcripts include the alignment of experimentally derived sequence, either from a tiling microarray or direct sequencing method, and computational sequence analysis to identify conserved sequence motifs associated with a specific class of non-coding RNA (Adelson & Raison, 2010). In the horse, Zhou et al. (2009) used a predictive and comparative approach to identify more than 400 novel equine microRNAs.
In addition to the transcribed regions, the genome contains a number of sequence elements that participate in the regulation of gene expression. These regulatory elements are categorized primarily as “cis-acting” elements (regulatory sequences in close proximity to the gene or genes they affect) and long-range elements (regulatory elements that exert influence at a distance). Transcriptional regulatory elements in the genome are recognized by, and in most cases interact with, “trans-acting” factors (transcription factors, accessory proteins, RNAs, enzymes, and metabolites) to direct (regulate, influence, modify) transcriptional parameters. Names commonly applied to specific regulatory sequence elements include promoters, enhancers, repressors, insulators, locus control regions, and positive or negative molecular response elements. Although not expressed directly, annotation of these regions is an important consideration in functional genomics.
Genetic interactions
Primary annotation of a segment of DNA within the genome often represents only the first level of functional assessment as it relates to phenotype and the biology of an organism. Higher-level understanding includes how individual DNA segments (or their encoded products) functionally interact. These interactions can be divided into two broad types. First, there are the interactions of different functional elements (regulatory sequences, protein-coding genes, non-coding RNAs) to generate a specific set of transcripts. Second, there are functional interactions involving two or more genes or their encoded products (epistasis). Systematic analysis of these genetic interactions helps provide a more complete understanding of pathways and networks involved in different biological processes.
Epigenetics/epigenomics
The genome also carries information that is not directly encoded by the nucleotide sequence, but which can clearly influence gene expression and other functional parameters. This information is generally referred to as epigenetic or epigenomic because it is carried outside of the primary genome sequence. Examples of two main types of epigenetic modifications that can influence gene expression include DNA methylation and chromatin structure. DNA methylation is the addition of a methyl group by a DNA methyl transferase to convert cytosine nucleotides to 5-methylcytosine. This methylation most commonly occurs at GC repeats (called CpG islands), with high levels of methylation correlated to low levels of transcriptional activity. A subset of epigenomic parameters, including some specific patterns of methylation, can be transferred from parent to offspring. This process is referred to as genomic imprinting. Factors such as uterine environment can influence imprinting, as suggested by studies in mule and hinny pregnancies that result in different levels of equine chorionic gonadotrophin (Allen et al., 1993; Antczak et al., 2011). Epigenetic effects involving chromatin structure are often the result of histone modifications. The close association of histone structure and the DNA strand influence physical access of the transcriptional machinery to the nucleotides. DNA sequence stabilized in close association with histones can be sequestered from transcription while DNA in loose association is physically more accessible, leading to differential patterns of expression. Patterns of histone modification can be inherited, though specific mechanisms are not fully known. Indeed, many aspects of epigenetic modifications and their impact on functional genomics and phenotype remain a very active area of research (Beisel & Paro, 2011).
The Transcriptome
The transcriptome is the total sum of all RNAs transcribed from the genome. Until recently, the transcriptome was considered mainly as a “bridge,” serving as a mechanism to transfer information between genomic DNA and proteins (Costa et al., 2010). High-throughput sequencing and other technological advancements have dramatically expanded our understanding of transcriptome complexity (Lindberg & Lundeberg 2010), which includes not just the messenger, ribosomal, and transfer RNAs, but also an expanding list of additional transcript classes (Table 8.1).
Protein Synthesis | |
Messenger RNA | mRNA |
Ribosomal RNA | rRNA |
Signal Recognition Particle RNA | SRP-RNA |
Transfer RNA | tRNA |
Transfer Messenger RNA | tmRNA |
Promoter Associated Short RNA | PASR |
Transcription Start Site RNA | TSSa-RNA |
Transcription Initiation RNA | tiRNA |
Termini Associated Short RNA | TASR |
Transcriptional Modification and DNA Synthesis | |
Small Nuclear RNA | snRNA |
Small Nucleolar RNA | snoRNA |
SmY RNA | SmY |
Small Cajal Body-Specific RNA | scaRNA |
Guide RNA | gRNA |
Y RNA | – |
Telomerase RNA | – |
Regulatory | |
Antisense RNA | aRNA |
Natural Antisense Transcripts | NAT |
Natural Antisense Transcripts Small Interfering RNA | natsiRNA |
Long Non-coding RNA | lncRNA |
Micro RNA | miRNA |
snoRNA-derived RNA | sdRNA |
Piwi-interacting RNA | piRNA |
Small Interfering RNA | siRNA |
Transacting Small Interfering RNA | tasiRNA |
Repeat Associated Small Interfering RNA | rasiRNA |
Long-Interspersed Non-coding RNA | lincRNA |
Promoter Upstream Transcripts | PROMPT |
Cryptic Unstable Transcripts | CUT |
Catalytic (Ribozymes) | |
Ribonuclease P | RNase P |
Group I Self-Catalytic Intron | – |
Group II Self-Catalytic Intron | – |
Mammalian CPEB3 Ribozyme | CPEB3 Ribozyme |
Glucosamine-6-phosphate Activated Ribozyme | glmS Ribozyme |
Beta-globin Co-transcriptional Cleavage Ribozyme | CotC Ribozyme |
Profiling the equine mRNA transcriptome
The mRNA transcriptome refers to all the RNAs transcribed from the genome that code for proteins. Quantitative and qualitative assessment of gene expression on a transcriptome level enables broad analyses of gene expression and their collective impact on cellular activity in normal and pathological conditions (Hoheisel, 2006). The original “one gene → one transcript → one protein” model has been replaced by knowledge that a single protein-coding gene can in fact produce multiple distinct transcripts, explaining the apparent discrepancy in the number of different proteins compared to the number of protein-coding gene loci. Results from the ENCODE projects have shown that, on average, there are 5.4 transcripts generated from every protein-coding gene locus (Lindberg & Lundeberg 2010). A number of genome-wide techniques have been developed to assess the mRNA transcriptome. These include the generation of expressed sequence tags (ESTs; Nagaraj et al., 2007), serial analysis of gene expression (SAGE; Velculescu et al., 1995), microarrays (Schena et al., 1995), and the analysis of mRNA by next-generation sequencing (RNA-seq; Wang et al., 2009).
To date, microarrays have served as a primary experimental method for analyzing gene expression in the horse on a transcriptome level. Objectives of these studies have been the identification of transcripts with differential patterns of expression associated with specific equine physiological or pathological traits (reviewed by Chowdhary & Raudsepp 2008; Ramery et al., 2009). The earliest reported application of microarrays for equine gene expression analysis was conducted using a non-equine array, relying on sequence conservation across species to detect gene-specific hybridization. Mousel et al. (2002) profiled equine PBMC gene expression using a human cDNA microarray. Human microarrays were subsequently used to profile steady state mRNA levels in equine testicular tissue (Ing et al., 2004), equine superficial digital flexor tendon (Nomura et al., 2007), chronic equine respiratory disease (Ramery et al., 2008), and in equine brain, liver, and articular chondrocytes (Graham et al., 2009). Mouse-specific microarrays have been used to profile gene expression in equine blood and muscle cells (Barrey et al., 2005; Barrey et al., 2006; Mucher et al., 2006). Budak et al. (2009) used a Bovine GeneChip to analyze laminar tissues in the hoof during the developmental phase of carbohydrate-overload-induced laminitis.
Concurrently, efforts were under way in multiple laboratories to develop equine-specific microarray platforms constructed with probes representing a subset of the equine mRNA transcriptome. The first published report was by Gu and Bertone (2004) and included probes for 3,098 expressed equine sequences on an Affymetrix platform. Performance was evaluated using lipopolysaccharide stimulated synoviocytes, and the array was subsequently used to investigate the pathology of equine musculoskeletal diseases and treatment strategies, and stem cell differentiation (Smith et al., 2006; Zachos et al., 2006; Santangelo et al., 2007; Yuan et al., 2008; Murray et al., 2010). The second equine microarray, a spotted cDNA array, had probes representing 1,000 expressed equine sequences and was used to examine the gene expression profiles of pro-inflammatory conditions (Vandenplas et al., 2005a) with particular emphasis on the in vitro effects of bacterial cell wall toxins on leukocyte gene expression (Vandenplas et al., 2005b). This array was subsequently expanded to represent 3,076 expressed equine sequences and used to study temporal aspects of equine laminitis (Noschka et al., 2009). The third equine-specific array had 9,322 spotted cDNA sequences representing 5,307 different genes (Figure 8.2; Macleod, 2005). The cDNA sequences were isolated from an articular cartilage library with sequence identity determined using BLAST to define gene homology and DAVID to discern gene ontology. This array has a fairly broad representation of expressed equine genes (MacLeod et al., 2003; Coleman et al., 2007; Mienaltowski et al., 2008b) and has been used for a number of transcriptional profiling experiments, including articular cartilage maturation and repair (Mienaltowski & MacLeod, 2006; Mienaltowski et al., 2008; Mienaltowski et al., 2009; Mienaltowski et al., 2010), optimal culture conditions for articular chondrocytes (Miura & MacLeod, 2006), identification of both stable and highly differentiated patterns of gene expression (Zhu et al., 2007; Tremblay et al., 2009, Vanderman et al., 2011), and muscle exercise physiology (McGivney et al., 2007; McGivney et al., 2009). Data produced by this array has also been used to develop statistical methods for the analysis of microarray data (Huang et al., 2008a, 2008b). Another equine-specific microarray developed on the Affymetrix platform with 12,320 probe sets has been used to investigate gene expression in articular cartilage from young horses (Nixon et al., 2008). This array was also used to compare articular cartilage to other equine tissues (Glaser et al., 2009). Finally, Barrey et al. (2009) constructed an array that included 334 probe sets for nuclear transcripts and 50 probe sets representing mitochondrial genome features. This array was used to study gene expression in skeletal muscle biopsies from horses with polysaccharide storage myopathy.
With completion of the equine reference genome sequence (Wade et al., 2009) and subsequent in silico gene structure predictions (Ensembl and NCBI), several groups initiated efforts to construct improved equine-specific microarrays, which for the first time approached an assessment of all protein-coding genes in the equine genome. Three arrays have been generated using the Agilent Custom Array platform. Miller et al. (2009) used available equine EST and UniGene sequences combined with structural predictions from NCBI to produce an array with 14,357 probe sequences. Array performance was assessed by comparing expression patterns between invasive and noninvasive trophoblast cells in the equine chorionic girdle and membrane. Klein et al. (2010) used the Ensembl gene structure predictions and expressed sequences generated by 454 pyrosequencing to construct an array with 43,803 probe sequences. The array was used for transcriptional profiling of equine endometrium during maternal recognition of pregnancy. Independently, Agilent developed a commercially available array using the Ensembl gene structure predictions. This array included 43,553 probe sequences and was also used to compare transcriptional profiles of equine endometrium during early pregnancy (Merkl et al., 2010). A fourth whole-genome array was developed using 70mer oligonucleotide probes designed from the equine reference genome sequence and mapped to gene ontology entries (Bright et al., 2009). This array has been used to generate expression profiling data in studies of laminitis (Wang et at., 2009; Wang et al., 2010) and recurrent airway obstruction (Kachroo et al., 2010).
Not all transcriptional profiling studies in the horse have been microarray-based. Suppression subtractive hybridization has been used to profile differentially expressed genes in wound repair (Lefebvre-Lavoie et al., 2005) and equine neonatal growth cartilage (Johannessen et al., 2007). Cappelli et al. (2005) used cDNA-amplified fragment length polymorphism techniques to study transcript profiles in different equine tissues. Illumina digital gene expression has been used to profile gene expression of leukocytes from horses with osteochondrosis (Serteyn et al., 2010).
Most recently, RNA-sequencing methods have been used to refine equine gene structure models and evaluate expression patterns across eight equine tissue samples (Coleman et al., 2010a; Coleman et al., 2010b, Vanderman et al., 2011). RNA sequencing or transcriptome shotgun sequencing (RNA-seq; Wang et al., 2009b) has been used to investigate the transcriptomes of several species including human, mouse, yeast, and Arabidopsis (Cloonan et al., 2008; Lister et al., 2008; Morin et al., 2008; Mortazavi et al., 2008, Nagalakshmi et al., 2008; Pan et al., 2008; Rosenkranz et al., 2008; Sultan et al., 2008; Wang et al., 2008; Wilhelm et al., 2008). It has been a transformative technology, generating quantitative and qualitative data concurrently on a transcriptome scale (Figure 8.3). A key benefit of RNA-seq is the unprecedented view it provides of alternative transcripts from the same gene, particularly in the ability to map splicing variants (Lindberg & Lundeberg, 2010). An intensive study of splicing patterns demonstrated that more than 90% of multi-exon human genes are subject to alternative splicing and that more than half of all splicing events occur in a tissue-restricted pattern (Wang et al., 2008). Significant efforts have been made to develop methods capable of identifying RNA splicing events. Computational tools now exist that use RNA-seq tags to detect splice junctions, notably MapSplice (Wang et al., 2010b), TopHat (Trapnell et al., 2009), MMES (Wang et al., 2010c), SplitSeek (Ameur et al., 2010), and SpliceMap (Au et al., 2010).