Another common practice among those who sequence and assemble genomes is to make use of animals that have high homozygosity. The reason that this is desirable is to minimize the considerable differences between different versions of the same chromosome that exist even within a species. The presence of intra-species variation presents extra challenges to the computational genome assembly process. We can find animals with high homozygosity by selecting animals with high levels of inbreeding. Twilight was chosen because she demonstrated the highest level of homozygosity among the horses tested for this purpose using a set of DNA markers.
The equine genome was sequenced using the whole-genome shotgun method. This process involves shearing the DNA into fragments and inserting these into clone libraries for replication and sequencing. Typically only the end-most 500–800 base pairs of sequence are read from these “insert” fragments and these are called paired-end reads. Different-sized fragments are used to create bridges of different lengths that can join assembled “contigs” together to span gaps in the assembly. Contigs that are joined together unambiguously across gaps of known size into larger sequences are termed scaffolds. In the horse genome, insert fragments of 5,000 bases, 10,000 bases (both termed cosmid clones), and 40,000 bases (fosmid clones) were used for the sequencing. Later Bacterial Artificial Chromosome (BAC) sequences of around 150,000 bases from a related male horse were included into the assembly at the final stage.
Features of the Equine Genome Assembly
A high-quality draft assembly was constructed, with added contiguity generated by the inclusion of BAC end sequences from a related male Thoroughbred horse, from which a BAC map had been produced by researchers in Germany (Leeb et al., 2006). The resulting assembly had sequence coverage of 6.8 fold. The generated assembly (designated EquCab2.0) is of high quality and contiguity. More than 50% of the genome is contained in contiguous sequences of longer than 112 kilobases (kb) and in scaffolds of longer than 46 megabases (Mb). More than 95% of the euchromatic sequence was able to be anchored to the 64 (2N) equine chromosomes. Many features of the equine genome were similar to those of other mammals, but there were a number of notable differences. The genomes of human, cow, dog, and mouse were used for comparison based on their interesting population structures (Lander et al., 2001; Waterston et al., 2002; Lindblad-Toh et al., 2005; Liu et al., 2009).
The estimated euchromatic genome size of Equus caballus based on the total lengths of scaffolds lies between those of the dog (2.5 Gb) and human genomes (2.9 Gb) (Lander et al., 2001; Lindblad-Toh et al., 2005). Segmental duplications in the assembly, determined using standard methods (Mikkelsen et al., 2007), comprise less than 1% of the equine genome, and the majority of these (with ∼80% mapped to chromosomes) are intra-chromosomal duplications such as those seen in the dog and mouse genomes. So, while the size of the genome based on the scaffold structure appears to be around 2.5 Gb, the assembly has many unplaced sequence reads suggesting that the true genome size may be up to 2.7 Gb. The unassembled sequences are highly repetitive in nature.
Comparison with Genetic Maps
The task of the genome assembly team is to construct contiguous sequences and scaffolds from sequencing reads. The general equine genetic researcher is interested in not only the amalgamation of sequence reads into contigs and contigs into scaffolds, but also to know the chromosomal location of these scaffolds on the equine karyotype. The way that this is accomplished is by using the existing genetic maps for the horse (Guérin et al., 2003; Penedo et al., 2005; Swinburne et al., 2006; Raudsepp et al., 2008) to place markers with known chromosomal locations onto the equine assembly. This serves two purposes: (1) to ensure that the assembly is accurate and (2) to establish the order and orientation of sequences onto the chromosomes. To correctly order and orient the contigs and scaffolds onto chromosomes, it is necessary that two markers with known physical locations should be available for each scaffold. Where mapped markers are unavailable and the scaffold is of sufficient size (typically > 2 Mb), a different method of ordering and orienting the scaffold is carried out. This uses fluorescence in situ hybridization (FISH) mapping. In this process, fluorescently labeled genomic sequences are hybridized to equine karyotypes and the chromosomal assignment is made by the observation of the fluorescence by an expert cytogeneticist (Lear et al., 1998). Ordering and orienting of the scaffolds is the ultimate process in the creation of the genome assembly. Once the genome is assembled, it becomes available to researchers for analysis.
Repetitive Elements
Repetitive elements occur at high frequency in all mammalian genomes. The typical mammalian genome has more than 40% of its DNA sequence derived from small mobile DNA elements known as transposons. We commonly term these elements as repeats. Such elements have been active in genomes throughout evolutionary history. As a result of this activity, some transposons are shared among animal species, whereas others have evolved uniquely subsequent to speciation. Using standard mammalian repeat libraries that can identify common elements across mammals, 39% of the equine genome assembly was annotated as comprising repetitive transposon-derived sequences. By applying customized libraries that were designed to include horse-specific repeats, 46% of the assembled sequence was identified as repetitive, a quantity comparable with that seen in the human genome. The predominant repeat classes present in the equine genome included long interspersed nuclear elements (LINEs) that were dominated by L1 and L2 types (19% of bases), and short interspersed nuclear elements (SINEs) such as the recent Equine Repeat Elements 1/2 (ERE1/2) and the ancestral Mammalian Interspersed Repeats (MIRs) (7% of bases). Both of these element classes are common in many mammalian genomes. Examination with the horse-specific libraries found that novel equine repeats accounted for a large fraction of the observed consensus transposon element sequences in the horse, but that only 48 of these consensus elements were present in significant numbers (more than 100 copies).
Chimeric repeats appear to stem from the random placement of new repeat sequences within existing repeats. Chimeric repeats were also identified as an important source of repetitive elements in the horse but they were difficult to classify unambiguously. The different repeat classes seemed to occur within the chimeras in proportion to their relative overall frequency in the genome.
Synteny with Humans
Comparative genomics is a term used to describe the use of information gained from the genome of one species to inform our understanding of another. Conserved linkage describes a preservation of gene order across species and it suggests an evolutionary commonality of sequence origin. If we compare horse and human chromosomes we observe surprisingly strong conserved linkage between these species given that we and horses seem phenotypically (in appearance) to be quite dissimilar. In fact, it might be surprising to learn that horses are a little closer to humans on an evolutionary scale than they are to cows.
Syntenic blocks are regions where a portion of any chromosome from the first species is present in the second species without another chromosome sequence intervening. Syntenic segments describe similar regions but do not allow for directional rearrangements between sections of the sequence. More than 2.76 Gb (out of 2.9 Gb) of the human genome sequence is covered by horse syntenic blocks that are at least 100 kb long. This implies that there has been relatively little rearrangement between the species and their common ancestor. Compare this with the weak synteny observed between human and mouse (Waterston et al., 2002). At the resolution of 100 kb-sized sequence windows, we find only 86 syntenic blocks and 425 syntenic segments in the alignment of the horse and human genomes. While the human syntenic segments defined relative to horse and relative to dog have similar average sizes, 50% of the human genome sequence resides in blocks 26 Mb or larger on the horse genome, while on the dog genome the equivalent figure is a smaller 20 Mb. This larger block size between human and horse is strongly influenced by 12 large human-horse segments that exceed the maximum size observed in the dog-human comparison. In fact, 17 chromosomes in horse comprise material from a single human corresponding chromosome (53% of horse chromosomes and 29% of dog chromosomes correspond to a single counterpart human chromosome) (Figure 6.2).