Chapter 2 Comparative Medical Genetics All the genetic information needed for the creation, maintenance, and reproduction of an organism is called the genome. For most organisms, this information is encoded in the DNA (deoxyribonucleic acid) or for some viruses in the RNA. A first step in the gigantic endeavor to understand this genetic information is to learn about the complete nucleotide sequence of a genome. Such genome projects have been or will be undertaken for many different organisms. The progress made with the Human Genome Project around the turn of the century has not only produced an extraordinary resource for genetic research in human medicine, but it has also created the means for sequencing additional genomes. Following the completion of the high-density human genome sequence, these tools and sequencing capacities have been used for a variety of species, starting with that of model organisms. The mouse, as one of the most relevant models for genetic research, was the second mammal to be sequenced; however, genome sequences from rat, fruit fly, and zebra fish were soon to follow. The next group of genomes included those of domestic animals, such as the dog, cow, chicken, and pig, which were chosen because they also serve as model organisms and are of special interest as either companion or food animals. Genomes of other animals, including cat and horse, were chosen to help with the annotation of the human and other mammalian genomes (comparative annotation). They were sequenced at lower genome coverage and are expected to provide important information about genome evolution. Alignment and comparison of the available animal genomes to the human will help identify evolutionarily conserved regions, which mostly likely represent important functional elements. This is a critical step for the annotation of the human and animal genomes and the understanding of genomic function. Completed genome sequences for several domestic animals are now available (Table 2-1) and semiannual updates on the status of current sequencing projects are listed on the National Institutes of Health (NIH) website (www.genome.gov/10002154). Many aspects of the canine genome and its impact on comparative and medical genetics are covered in The Dog and Its Genome (Ostander et al., 2006). The knowledge about the genomes of companion animals will have an enormous impact on veterinary medicine by facilitating the identification of genes underlying breed characteristics including behavior, coat color, body type, disease predispositions, and the detection of disease-causing mutations. This knowledge will lead to great advances in genetic screening for desirable and disease-causing traits as well as breed-specific vaccine and drug development (custom drug design). It will also change livestock breeding and production through identification of productivity and disease-resistance genes. The nuclear genome is composed of a species-specific number of linear DNA molecules, which are packaged into chromosomes. The number of chromosomes varies greatly among eukaryotes (for haploid chromosome numbers, see Table 2-1) but appears to be unrelated to genome size and its biological features. During cell division, DNA is duplicated and then condensed into the more compact forms of chromosomes. The varying sizes, location of centromeres, and the characteristic banding patterns revealed by staining techniques allow for the identification of individual chromosomes. For each organism, the arrangement of chromosomes by pairs (homologous chromosomes), according to standard classifications, is referred to as the karyotype (see the example in Fig. 2-1) and can also be depicted as a drawing called an ideogram. The two types of genome maps (i.e., physical and genetic maps) are important tools for the sequencing and assembly of whole genome sequences. Once established, they are great resources for locating and sequencing genes, such as those involved in diseases. A physical map depicts the position of a specific DNA segment in a genome—for example, its location on a specific chromosome. A genetic map describes the order and distance between specific DNA sequences in terms of the rate of DNA recombination between homologous chromosomes during meiosis, and it is determined from breeding experiments and pedigree analyses. Integrated maps use DNA segments as markers that are mapped to both maps and display information from both. Different techniques have been applied to construct physical maps as new techniques were made available. a. Fluorescent In Situ Hybridization (FISH) FISH enables the assignment of a DNA molecule directly to a chromosome. Hybridization of several DNA fragments simultaneously reveals not only their individual location but also their relative order to each other. To perform a traditional FISH experiment, cells are harvested in the metaphase stage of mitosis and their chromosomes are fixed onto a glass slide. Individual chromosomes can be distinguished by their distinct banding patterns and other cytological features. A specific DNA molecule (also referred to as a probe) is labeled with a fluorescent dye and hybridized to the denatured chromosomes. The single-stranded DNA probe anneals to its complementary strand in the chromosome in a sequence-specific manner, and the physical location of the probe is microscopically visible as a bright fluorescent signal. With the development of fluorescent labels that have specific emission spectra, multiple DNA probes can be hybridized simultaneously to a single chromosome preparation, allowing their ordering on a chromosome (Fig. 2-2). Another useful application of multicolor FISH is called chromosome painting: multiple probes distributed throughout the length of one chromosome are labeled with the same color dye at a density such that the entire chromosome is covered by fluorescence. As chromosome-specific probe sets are hybridized with different colors, each chromosome reveals its unique color, which is particularly useful to examine chromosomal abnormalities like deletions, duplications, and translocations of chromosomal segments. Because FISH allows only low-resolution mapping (probes > 1 Mb apart), other techniques need to be applied for finer, high-resolution mapping. b. Restriction Enzyme Mapping Restriction endonucleases are enzymes isolated from various strains of bacteria that recognize and cleave specific double-stranded DNA sequences, called restriction sites, with the majority of sites consisting of only four to seven nucleotides (see the example in Fig. 2-3). A DNA segment, digested by a specific restriction enzyme, is cut into smaller DNA fragments of different sizes depending on the number and location of the recognition sites present within the DNA sequence. The differently sized fragments can be separated by agarose or polyacrylamide gel electrophoresis. A simple way to create a restriction map of a smaller genome is to first cut the DNA using two separate reactions, each with a different restriction enzyme, and then in an additional reaction simultaneously with both enzymes to compare the resulting fragment size patterns. This will allow one to assess the number of restriction sites for each enzyme by single digests and then the relative positions to each other by the double digest (Fig. 2-4). However, with an increasing size of the DNA segment to be mapped, the number, sizes, and order of resulting fragments become too complex. Then analysis requires cloning smaller fragments or other mapping techniques. c. Sequence Tagged Site (STS) Mapping STSs are short nonrepetitive DNA segments that are located at unique sites in the genome and can be easily amplified by the polymerase chain reaction (PCR). Common sources to obtain STSs represent expressed sequence tags (ESTs), microsatellites (discussed later), and known genomic sequences that have been deposited in databanks. ESTs are short sequences obtained by converting mRNA into complementary DNA (cDNA). They are unique and valuable sequences, because they represent parts of expressed genes of the cells or tissue used for the mRNA extraction. To construct a genome map using STSs, different DNA resources, sometimes called a mapping reagent, can be used. The most common resources are radiation hybrid panels or clone libraries, both of which can be constructed using either whole genome sequences or a single chromosome. i. Radiation Hybrid (RH) Mapping Radiation cell hybrids are typically constructed using cells from two different species. Cells from the organism whose genome is to be mapped (donor) are irradiated with a lethal dose and then usually fused with rodent (recipient) cells. The irradiated chromosomes break at random sites and, after cell fusion with the recipient cells, the donor chromosome fragments are incorporated into the recipient chromosomes. Consequently each hybrid cell line derived from a single cell contains different parts of the donor’s chromosomes, which were incorporated at random. Radiation hybrid mapping is based on this artificially induced random breaking of the genomic DNA into smaller fragments. The original order of these fragments to each other is determined by ascertaining that specific DNA sequences are found to be in the same clones, which means that they segregate together because of their close physical proximity in the genome. For detailed mapping, fewer than 100 hybrid cell lines are necessary. For example, irradiated canine cells were fused with recipient hamster cells, and 88 cell lines were selected (Hitte et al., 2005). To map the canine genome, DNA from each cell line is being tested for the presence or absence of unique canine markers, like STSs. If two markers are originally located closely on a chromosome, a break between the markers is unlikely, and, therefore, they will mostly be found together in the same cell line. In contrast, if they are farther apart or even on different chromosomes, the separation of the two markers into different cell lines is likely. Hence, the actual distance between two markers on a chromosome is proportional to the probability of the markers being separated and found in different cell lines. Analysis of hundreds to thousands of markers allows for the determination of the order and distance between markers. Higher resolution RH maps can be achieved by increasing the intensity of the initial radiation of the donor cells leading to increased chromosomal breaks and smaller average fragment sizes. The probability of separation between closely located markers increases, thereby permitting the ordering of more markers. ii. Clone Library A clone library consists of DNA fragments, representing the total DNA from a specific chromosome or whole genome, inserted into some type of vector that can be grown in bacteria, yeast, or mammalian cells. To construct a library, the source DNA is cut into random fragments, usually by a restriction enzyme that has a 4 bp recognition site and therefore cuts the DNA frequently. However, the digestion of the DNA is purposely prevented to go to completion, leaving randomly larger uncut fragments that partially overlap. These fragments are then cloned into vectors, for example, plasmids, which incorporate the DNA and allow for easy amplification and isolation in bacteria. Different types of vectors accommodate DNA fragments of different sizes, ranging from hundreds to thousands of base pairs (bp). As with the radiation hybrids, the individual clones are analyzed for the presence or absence of STSs, which allows the ordering of these markers depending on their common presence in the same clones. Again, the resolution of the STS map can be raised by decreasing the size of the DNA fragments used for construction of the library. The STS markers are also used to identify overlapping clones to build contigs (a number of overlapping clones representing a region of a particular sequence). Because sequences obtained from each clone can be precisely anchored to the physical map, clone libraries are critical in the assembly of whole genome sequences. Breeding experiments or pedigree analyses can be used to genetically map genes or molecular markers. The basis for genetic mapping is that the distance between two markers on a chromosome is directly correlated with the probability of recombination between them during meiosis. Because each diploid cell has two copies of each locus (two alleles), it is by chance that half the time the alleles of two different loci on different chromosomes are inherited together (Fig. 2-5a). In other words, in 50% of offspring (or meioses), the same alleles of the two loci are found together, although they are located on different chromosomes. However, if the two loci are located on the same chromosome, it is less likely that their alleles will be separated and, therefore, should segregate together in > 50% of the offspring. If they are found separated in some of the individuals, then they are said to have recombined. The frequency of recombination is correlated with the distance between the two loci. If they are closely located, then recombination between the two markers will happen less often and will be <50%, and approaching 0% for very closely located markers (Fig. 2-5b). Markers are said to be linked if recombination between them is <50%. To be able to follow the inheritance of different alleles of a genetic marker in a pedigree, they need to be polymorphic for a DNA variation (discussed later). Hundreds to thousands of these genetic markers are then analyzed in a number of families. Likelihood calculations for linkage based on the percentage of recombination between any two markers permit the ordering of the markers to each other into linkage groups and ultimately into a genetic map. The distance between markers on a genetic map is based on the recombination rate and expressed in centiMorgans (cM; 1 cM = 1% recombination). The resolution of a genetic map depends on the number of individuals as well as how informative the markers tested are. a. Genetic Markers Although more than 99% of the DNA sequence is identical between individuals of any mammalian species, much variation remains. These sequence differences are known as polymorphisms, contribute to breed and individual differences, and have been useful for many practical applications, including genome mapping, screening for genetic diseases, and forensic applications such as DNA fingerprinting. Most variations are located outside of genes and generally do not affect any gene function. However, some of these polymorphisms may contribute toward physical characteristics or disease susceptibility. In contrast, polymorphisms within a regulatory or coding sequence of a gene can have deleterious effects on gene function. These polymorphisms have a lower frequency within a population and are referred to as mutations. Mutant alleles of genes are often associated with a genetic disease or disease predisposition and are referred to as disease genes or alleles. i. Restriction Length Polymorphisms (RFLPs) One of the first widely used techniques to detect DNA variations in a population was the analysis of RFLPs. Polymorphisms between individual DNAs can either destroy existing or create new endonuclease recognition sites and, thereby, lead to different fragment size patterns following restriction enzyme digestion. To test for a specific RFLP, a DNA region is amplified by PCR and subsequently digested with a particular restriction enzyme. The resulting DNA fragments can be separated by gel electrophoresis and visualized by staining with ethidium bromide. A difference in number or size of fragments between individuals tested indicates a polymorphism within the restriction site of the enzyme used. Before the advent of automated PCR, RFLP analysis methods included the extraction and digestion of genomic DNA of each individual tested, separation by gel electrophoresis, transfer of the DNA to a nylon membrane, and subsequent hybridization with a radioactively labeled DNA probe that bound to a known region in the genome. If a variation within a restriction site of the enzyme used was located within or close to the region of a locus binding to the probe, the labeled bands would differ either in size or number between individuals. Although extraordinarily laborious and not very informative, these RFLPs were used as markers to construct the first human genetic linkage map. ii. Minisatellite or Variable Number of Tandem Repeats (VNTRs) Minisatellites or VNTRs, succeeding the RFLPs, are noncoding DNA sequences of <20 kb long, containing a variable number of 15 to 100-bp long repeat units (Fig. 2-6a), and are distributed throughout the whole genome. If genomic DNA is cut by a restriction enzyme that has no recognition site within the repeat unit but cuts the remaining DNA fairly frequently, a large number of different-sized fragments can be identified. Because the numbers of the repeat units at most of the minisatellite loci vary among individuals, the resulting pattern of differently sized fragments is unique to each individual and is, therefore, called a DNA fingerprint (Fig. 2-6b). Although minisatellites are more informative than RFLPs, their analysis still is time consuming. iii. Microsatellites or Simple Tandem Repeats (STRs) With the advent of PCR, microsatellites soon replaced minisatellites as well as RFLPs as genetic markers. Microsatellites or STRs are composed of simple sequence repeats of 2 to 7 nucleotides. The number of repeat units may greatly differ among individuals resulting in alleles of varying lengths. PCR primers flanking the repeat are located in genome-wide unique sites and, therefore, allow for unique and easy amplification of one specific marker. Although initially radioactively labeled primers were used, fluorescent labels, automated DNA sequencers, and analysis software now allow for fast and inexpensive analyses. The abundance of STRs throughout the genome and ease of analysis greatly improved genetic maps in humans and animals with ever-increasing resolution. iv. Single Nucleotide Polymorphisms (SNPs) The most frequent, evenly distributed genome sequence variations (e.g., >4.5 million in humans) are SNPs, where a single nucleotide (A, T, G, or C) at a locus differs between individuals in a biallelic fashion. A small fraction of SNPs gave rise to the RFLPs described previously. However, current technologies allow for automated analysis of tens of thousands of SNPs per sample simultaneously, making it the preferred tool for genome-wide analysis in the search for mutations responsible for diseases. The commercially available high-density oligonucleotide microarrays or DNA chips contain thousands of different oligonucleotides representing different sequence variants. Hybridization of labeled sample DNA to the chip and subsequent analysis with a fluorescent scanner will result in a typical hybridization pattern. Because the representative genomic location of each oligonucleotide on the chip is known, the assessment of the pattern permits genotyping of several thousand SNPs per sample. SNP maps and chips, developed for humans and some domestic animals, are most useful to find the sequence variations that affect gene function associated with health, production, and disease. Because some markers can be analyzed on both physical and genetic maps, they serve as anchors to compare and combine data from both maps. The resulting integrated map lists the order of the markers and gives their distances in both genetic and physical scales. Comparative genomics, utilizing information about different genomes, is particularly important in the understanding of genomes of various organisms. If two organisms have a recent ancestor, their genomes will be related. Comparative maps display similarities between two organisms by aligning genes and their order on a chromosome of one species and then comparing it to the location and order found in another species. This knowledge is useful for mapping, identifying and isolating genes, and gaining more information about principles of evolution. Comparison of the actual genome sequences of different species allows the detection of highly conserved regions within or around genes that, besides representing exonic sequences, most likely serve as important regulatory elements in gene expression and function. A major objective of genetic research is the identification of DNA mutations that is involved in disease or genetic predisposition. A small sequence change located within a gene can alter or eliminate gene or protein function. These mutations either arise during imperfect DNA replication or are caused by mutagens and are distinguished by the type of change in the nucleotide sequence. A replacement of a single nucleotide with another base is called a point mutation, which can either be silent (the amino acid remains unchanged), a missense (changes the amino acid), or nonsense point mutation (producing a stop codon). Insertion or deletions refer to varied numbers of nucleotides that are added or deleted, respectively. Nonsense point mutations and deletions or insertions unequal to an exact multiple of 3 bp can result in an early stop codon and consequently in a shortened, unstable, or malfunctioning protein. Protein function can also be impaired by the change or addition/deletion of amino acids because of a mutation within the coding region (missense). Additionally, mutations within noncoding sequences that are necessary for correct gene regulation and function can also lead to a change in expression or nonfunctional proteins. In single gene disorders such a specific mutation that is severe enough to cause disease by itself and often shows a simple (Mendelian) inheritance pattern. If the inheritance is said to be dominant, only one mutant allele is sufficient for the development of the disease in an affected individual. Because the second allele is a normal (wild-type) allele, the affected individual is considered to be heterozygous. If both alleles have to be mutated to cause clinical disease, then the inheritance pattern is said to be recessive and the affected animal is homozygous for the mutant allele. If the mutation is located on the X chromosome, the affected male is considered to be hemizygous. Complex or polygenic disorders are caused by sequence variations in only a few or numerous genes and are more difficult to evaluate. The influences of environmental factors are being recognized and explain some of the variation in disease presentations of simple and complex inherited traits. To identify mutant alleles, various methods have been applied. If the phenotype or metabolic basis of the disease to be studied is well characterized or previous research has been done in humans or in other animal species with a similar disease, there might be potential genes (known as candidate genes) that can be suspected to be involved based on the previous findings or known function. Candidate genes can be evaluated for their involvement by testing for linkage or association (discussed later) or direct sequencing of coding regions, exon/intron boundaries, and promoter regions from unaffected and affected animals. For example, symptoms seen in human patients with phosphofructokinase (PFK) deficiency closely resembled those in other glycogen storage diseases and extensive biochemical analyses revealed the deficiency of the key regulatory glycolytic enzyme muscle-type phosphofructokinase (PFK) (Tarui et al., 1965). The gene was then cloned. Based on this information the canine PFK gene was sequenced in English springer spaniel dogs affected with PFK deficiency and a nonsense mutation identified (Smith et al., 1996), which is different from published mutations responsible for PFK deficiency in humans (reviewed in Nakajima et al., 2002). Protein-based functional assays are another common way to determine if a candidate gene is involved in the development of a disease. This approach led to the diagnosis of PFK deficiency in English springer spaniel dogs experiencing hemolysis and myopathy (Giger et al., 1985). If there is no candidate gene, a linkage approach involving a whole genome scan utilizing the molecular tools described earlier is an option to identify a chromosomal region or gene linked to the disease. This approach requires medical and pedigree information and a source to isolate DNA from a fairly large number of affected and nonaffected animals. Animal breeding data should make it possible to acquire the necessary data (pedigrees) and samples from three-generation pedigrees for linkage studies. If more than one breed is affected with the same disease, the different genetic background found in different breeds may further assist in narrowing the DNA region of interest. Generally, association studies require an equal number of affected and unaffected (control) animals from a population.
I. INTRODUCTION
A. Genome Sequences
B. Mapping the Genome
1. Physical Mapping
2. Genetic Linkage Mapping
3. Integrated Maps
4. Comparative Maps
C. Disease Gene Mapping
1. Candidate Gene Approach
2. Genetic Analysis