Chapter 8
Limitations of Pathology and Animal Models
Natasha Neef
Vertex Pharmaceuticals, Boston, MA, USA
A clear understanding of the limitations of pathology and of the animal models that produce pathology end points is a key skill for toxicologists and study personnel. In routine regulatory toxicology studies, the anatomic and/or clinical pathology end points are frequently the most significant elements of the study data, and both the data and the pathologist’s interpretation of them can profoundly influence the subsequent development and use of the test article. Moreover, inappropriate generation and/or interpretation of pathology data can put humans at risk, waste animals, time and money and/or confound compound development. The purpose of this chapter is to provide an overview of the limitations of pathology and animal models, in order to allow toxicologists and study personnel to critically evaluate their pathology data and utilise pathology end points judiciously. The material presented reflects the personal opinions of the author.
8.1 Limitations of In Vivo Animal Models
8.1.1 Traditional Laboratory Species Used as General Toxicology Models
Routine general toxicology studies, as required by regulatory agencies, utilise young, healthy animals from outbred laboratory species. The design of these studies typically follows a standard format (Adams and Crabbs, 2013; Greaves et al., 2004), where the pathology portion uses a wide range of haematology, clinical chemistry and urinalysis end points and samples essentially all tissues for microscopic examination. Many of the limitations of these models are self-evident; the principal one is that animal physiology differs from that of humans, and therefore some toxicities observed in animal models may not be relevant for humans and some human toxicities may not be detected in animal studies (for a general review that includes summaries of concordance of human and animal toxicology data, see Greaves et al., 2004; Olson et al., 2000). Beyond this general principle, however, there a number of more or less obvious shortcomings of these models that toxicologists should bear in mind when assessing pathology findings (or absence thereof).
8.1.2 The Test Article May Not have Sufficient Pharmacological Activity in Routine Toxicology Species
In order for toxicology studies of substances such as pharmaceuticals (which act pharmacologically to achieve their desired effects) to correctly model human risk, the pharmacological activity of the substance in question at the doses used in the toxicology species must be at least broadly comparable with that in humans. This is most often a problem with biologic drugs, where monoclonal antibodies or other proteins can have very limited cross-species pharmacological activity. Similar target binding profiles between species are not sufficient to ensure this, since activity is sometimes also mediated by other parts of the molecule (such as the Fc portion of monoclonal antibodies), where binding and/or activity may differ between species, or the target itself may be distributed differently in the test species compared with humans. A good example of the limitations of animal models in this respect is the tragic outcome of a first-in-human clinical trial of a monoclonal antibody, TGN1412 – a humanised CD28 agonist antibody intended to act as a selective stimulant of regulatory T-cell expansion for the treatment of autoimmune diseases (Suntharalingam et al., 2006). In this case, despite comparable binding of TGN1412 to human and monkey CD28, TGN1412 did not demonstrate pharmacological activity in the cynomolgus monkey that was used as the primary toxicology species (Horvath et al., 2012). The monkey toxicology study thus failed to reproduce the cytokine storm that occurred in the human-trial subjects, leaving some with permanent disabilities. It later emerged that the reason for the lack of pharmacological activity in the cynomolgus monkey was that the main cell type responsible for eliciting the cytokine storm in humans (CD4+ effector memory T-cells) does not express the CD28 receptor in the cynomolgus monkey (Eastwood et al., 2010). Thus, even comparable target binding of the antibody in both humans and monkeys was insufficient to ensure pharmacological activity in the monkey, since the CD28 target was distributed on different T-cell subsets and performed a different function in humans.
Typically, biologic drugs with little or no activity in at least one toxicology species are very difficult to develop, and alternatives need to be found wherever possible. One approach is to create a surrogate test article that is active in a toxicology species. This is usually preferable to ‘humanised’ animal models (usually mice) that have been genetically modified to respond to the test article, since such models are usually poorly characterised and potentially still subject to other species differences, which may under- or even overestimate any adverse effects of the test article.
8.1.3 The Model May Not Identify Hazards Related to Causation or Exacerbation of Pathology that is Unique to Humans or Undetectable in Animals
There are a number of human diseases and other pathological conditions that are not meaningfully reproducible in standard toxicology species and for which routine studies are unlikely to be predictive. The most important example is exacerbation or precipitation of acute events related to human atherosclerotic cardiovascular disease. Nonsteroidal anti-inflammatory drugs (NSAIDs), particularly the relatively new selective COX2 inhibitors, are good examples of drugs that were extensively characterised in preclinical toxicology studies, but for which the cardiovascular disease liability for humans (always a theoretical possibility based on the mechanism of action of these drugs: Hawkey, 1999; Graham, 2006;) was widely recognised only post-approval using human data (Cannon and Cannon, 2012). Some of the precipitating causes of these human cardiovascular events were subsequently modelled in mice (Yu et al., 2012), but in the absence of the predisposing atherosclerotic vascular compromise, the human risk could not be evaluated directly using animal toxicology models.
Other significant examples include behavioural changes such as dysphoria and suicide ideation that are likely human-specific but in any case are not appreciable in laboratory species in the context of general toxicology studies. These types of finding in humans led to the withdrawal of rimonabant, the cannabinoid receptor-1 antagonist from the European market 2 years after its approval (Christensen et al., 2007).
The archetypal significant toxicity that is not generally predicted by preclinical toxicology studies is idiosyncratic drug-induced liver injury (DILI), which is a common cause of post-approval drug withdrawal, since it typically affects relatively few individuals and so is often not detectable in human clinical trials. It is similarly not generally observed in preclinical toxicology studies, despite the use of high doses in these studies (Greaves et al., 2004). Human-specific genetic factors likely play a role in individual susceptibility, and there is no widely accepted animal model (FDA, 2009; Daly and Day, 2012).
8.1.4 The Model May Not Identify Hazards with Low Incidence/Low Severity
Toxicology studies, particularly non-rodent studies, necessarily use limited numbers of animals – almost always fewer than the total number of humans potentially exposed to the test article in question. Whilst the use of high doses in these studies is intended to exaggerate toxicities in part to compensate for low animal numbers, it is still possible that even over multiple studies, toxicity issues can remain undetected or occur so sporadically that they are not attributed to the test article. Where the ‘true’ incidence of a particular finding is less than about 10–20% in rodents or 30–50% in large animals, it is perfectly possible that, by chance, a finding to which the toxicology species is susceptible would not occur at the highest dose in a standard study, and that if it occurred at a lower dose, it might be dismissed as not test article-related based on the lack of a dose response. This is particularly likely in non-rodent studies using small numbers of animals, and a common mistake in interpreting such data is to assume that the absence of a finding that occurred at a low incidence in rats (despite similar systemic exposures) constitutes a ‘rodent-specific’ finding. In these cases, it is wise to bear in mind that that with low animal numbers, absence of evidence is not necessarily evidence of absence.
8.1.5 Potential for Misinterpretation of Reversibility/Recovery for Low-Incidence Findings
A related problem in toxicology studies is overinterpretation of recovery animal data, particularly for non-rodents, which typically utilise group sizes of just two animals/sex/dose for the recovery arm. This can also occur with low incidence findings in rodents, where only five animals/sex is typical for recovery groups. The absence from recovery animals of a pathology finding that occurred in only a minority of animals sacrificed at the end of the dosing period may be simply because it was not present in any of the recovery animals in the first place, but toxicology reports sometimes interpret such data as unequivocally indicating ‘reversibility’. Reversibility from clinical pathology findings in both rodents and non-rodents is also sometimes claimed in toxicology studies, even when a glance at the data shows that the individual animals used for recovery were not affected at the end of the dosing period. In the case of anatomic pathology lesions, the presence of a clinical pathology biomarker for the lesion in question is useful in supporting an interpretation of reversibility, as is an understanding of the nature of the lesion and its likely reversibility (Perry et al., 2013).
8.1.6 Potential for Over- or Underestimation of the Relationship to Test Article of Findings that have High Spontaneous Incidence in Laboratory Species, but are Relatively Rare in Humans
The toxicologist should be aware of common spontaneous background changes in the different laboratory species, since chance variation in their incidence amongst the dose groups can mimic or mask a test-article effect. A common situation is the misdesignation of a finding as a test article-related when there is an increase in severity or incidence over that in the control group that is actually just due to chance, or when a finding is known to be a common background change but by chance does not appear at all in the control group in that particular study. These are difficult judgements to make, because genuine test article-related increases in findings that are indistinguishable from spontaneous findings can occur. However, an awareness by the toxicologist of the most common types of spontaneous pathology lesions (Chapter 3) can alert him or her to cases where further investigation – such as review of historical control data (to understand what constitutes a normal incidence of the finding in question) or obtaining a third-party expert opinion – may be appropriate to ensure the best possible interpretation of the data.
Reviews of spontaneous pathology in all the common laboratory-animal species are published fairly frequently (e.g. Lowenstine, 2003; Chamanza et al., 2010; McInnes, 2012). Typical examples include cardiomyopathy in rats and focal renal tubular degeneration/regeneration in most species; there is plenty of scope in these cases for random differences in incidence/severity between groups to either mask or falsely suggest a potentially serious test-article effect. This can be a particular problem in non-rodent studies where the number of animals is small.
A useful comparison for these situations is carcinogenicity studies, in which statistical analysis with correction for multiple testing and comparison with historical control data is performed routinely to prevent overinterpretation of differences in tumour incidence between treated and control animals (FDA, 2001). This means that, particularly for common lesions, p-values well below the traditional 0.05 cutoff are required for meaningful statistical significance. Since statistical analysis is not usually performed for non-neoplastic lesions in general toxicology studies, adjustment of this kind does not take place automatically, and overinterpretation of data is thus more likely.
Another limitation to consider is that comparison of incidences of putative test article-related findings with historical control data for general toxicology studies is sometimes unreliable, since the very fact that these findings are non-neoplastic and can be ‘normal’ means that many pathologists do not diagnose them routinely, especially in control animals, where they cannot be test article-related (McInnes and Scudamore, 2014). In these circumstances, historical control data can give a falsely low impression of the true background incidence of the finding. For toxicologists dealing with potential overinterpretation of pathology findings that are not explainable using historical control data, the best course (if this has not been done already) is to request comparison of the sections in question with a reasonable number of control animals from other recent studies, with review by both the study pathologist and the peer reviewer. If this is not possible, review of the slides by a third-party pathologist with a very large amount of experience reading studies of that particular duration in that species, may be justified to ensure the most appropriate interpretation of the data.
8.1.7 Exclusive Use of Young, Healthy Animals Kept in Ideal Conditions Gives Limited Predictivity for Aged/Diseased Human Populations
Minimising interanimal variability in order to improve the sensitivity of preclinical toxicology studies is one of the main reasons for using adolescent or young-adult animals that are free of intercurrent disease or ageing changes, and for standardising their environment, diet and other experimental conditions. These healthy animals generally have a large reserve functional capacity within major organ systems (immune, cardiovascular, renal, hepatic etc.), and so in many situations will be less likely to manifest evidence of organ malfunction in the presence of degenerative and/or functional changes mediated by a test article. In contrast, the general human population outside a clinical trial setting will contain many individuals whose susceptibility would be significantly greater than that of the test animal population, and an intended patient population will often contain a disproportionate number of individuals compromised in one or more respects. Again, the higher doses used in toxicology studies relative to anticipated human exposure will compensate to some extent for the high reserve functional capacity of the animal subjects, but in some situations – such as when overall tolerability issues prevent use of good exposure multiples in toxicology species – the use of healthy animals kept in ideal conditions is potentially a significant limitation of general toxicity studies.
Conversely, some toxicities can be exaggerated in young, growing animals, leading to an overestimation of the risk in an adult human patient population. A common example is compounds that affect bone deposition or remodelling, which can produce quite profound changes in young growing animals with active bone growth plates but have little or no effect in older human adults (Gunson et al., 2013). Typically, the more rapidly growing the test animal, the greater its sensitivity, which can lead to the seemingly counterintuitive situation where toxicities are more prominent in shorter-term studies than in longer-term ones conducted at the same doses, simply because at study termination, the animals are younger and more rapidly growing in the shorter-term studies.
Another limitation imposed by the use of healthy animals is that the pharmacology of the test article may become dose-limiting when it is intended to counteract abnormal physiology in the intended patient population. Examples include hypoglycaemic and hypotensive therapies that cause life-threatening hypotension/hypoglycaemia at low systemic exposure multiples in normotensive or normoglycaemic animals. This precludes observation of any other toxic features of the molecules that could occur in patients, who can tolerate higher doses.
Finally, a widely recognised limitation that can be difficult for toxicologists, clinicians and regulators to evaluate based on the data they receive in the study reports is the issue of the relative sexual immaturity of the young males in non-rodent studies confounding assessment of male reproductive-organ toxicity. Immature testes that are in a quiescent, prepubertal state will in most cases be less susceptible to reproductive toxicants, whilst peripubertal testes and epididymes frequently demonstrate evidence of aborted spermatogenesis and frank degeneration that are indistinguishable from potential test article-related toxicities (Creasy, 2003). Testicular changes related to peripubertal status may thus mask or falsely suggest test-article effects. Unfortunately, it is common for pathologists not to record immaturity or peripubertal status in these non-rodent studies, and thus the toxicologist or regulator may be none the wiser as to whether a study has adequately evaluated testicular safety or might be confounded by peripubertal changes within the testes. The age of the individual animals in the study is a useful guide to likely sexual maturity (minimum 5 years in the cynomolgus monkey (Smedley et al., 2002) and 10–12 months in the dog (Lanning et al., 2002; Creasy, 2003)), but individual animal ages are usually not documented in study reports. A simpler and more reliable way for the toxicologist to estimate sexual maturity in a toxicology study is to use testicular weight (as a surrogate for volume), since individual animal testes weights are normally readily available in the study report. In the cynomolgus monkey, a combined testicular weight of approximately ≥⃒20 g suggests sexual maturity (Ku et al., 2010); values in the dog are more variable, but testes weighing >20 g are certainly likely to be mature (Olar et al., 1983; Goedken et al., 2008).
8.2 Efficacy/Disease Models as Toxicology Models
Use of in vivo animal-disease models (that are traditionally used to evaluate the efficacy of candidate therapies as toxicology models) is becoming more widespread in the pharmaceutical industry and, under certain special circumstances, is being considered actively by regulators as a means of evaluating potential drug toxicities encountered in the clinic that would not be identified in traditional preclinical toxicology studies (FDA, 2014). The range of potential models is very diverse, and includes genetically modified rodents, surgical models such as ureter ligation to produce renal insufficiency (Chevalier et al., 2009), menisectomy to produce arthritis of the knee (Bendele and White, 1987), and disease states produced by known toxicants such as streptozotocin as a model for diabetes mellitus (Like and Rossini, 1976), or adjuvant-induced arthritis (Bendele et al, 1999). Such disease models have been used successfully to investigate toxicities emerging in the clinic: for example, a genetically modified mouse model of Alzheimer’s disease (APP23 transgenic mice) was used to investigate the pathogenesis of fatal meningoencephalitis occurring in humans with deposition of β-amyloid in the cerebral vasculature, which had been immunised with an amyloid-β in a clinical trial (Pfeifer et al., 2002).
The potential advantages of collecting safety data from animal-disease models include minimising the use of animals by collecting both safety and efficacy endpoints from the same set of study animals and obtaining toxicology information on particular therapeutic targets/candidate molecules at a very early stage in development. Mice that are genetically modified to lack the pharmacological target of the drug can also sometimes help to distinguish toxicities related to the intended pharmacology of the test article. Other possibilities include using supplementary toxicology models to overcome the problem of exaggerated pharmacology confounding the interpretation of data collected in routine toxicology studies. Examples include hypoglycaemic therapies such as synthetic insulins whose toxicology cannot be explored in normal animals due to life-threatening hypoglycaemia at low systemic exposure multiples (FDA, 2000). Even where reasonable drug exposures can be achieved in normal animals, pathologies due to exaggerated pharmacology can emerge, and special animal models are required to demonstrate a lack of such findings where the model resembles the target patient population. One example is glucokinase-agonist drugs, which were intended as hypoglycaemic agents for the treatment of diabetes mellitus; these produce vascular and neurological pathology findings in monkeys and rats, respectively, related to hypoglycemia that would occur rarely or not at all in the intended patient population (Pettersen et al., 2014; Tirmenstein et al., 2015).
That said, because of the limitations of these disease models (described below) they are not generally used as toxicology models, even where their use might permit higher exposures than would be tolerated by normal animals. Instead, they are more often used in directed studies to provide experimental evidence where this is needed to support a hypothesis that a particular toxicity will not be relevant for the intended use of the test article in humans (Morgan et al., 2013).
Animal models of disease could also theoretically also be used to determine whether the toxic effects of a test article might be accentuated in special patient populations. They might be used for drugs producing minor perturbations in organ function that have few or no adverse effects in standard toxicology studies but which might represent a significant risk for patients with pre-existing compromise of the organ system in question (most commonly immune, cardiovascular or renal). In practice, however, disease models are seldom used for this type of safety testing, and it is more common to evaluate test-article effects on immune, cardiovascular or renal function of potential relevance for special patient populations using directed specialised studies in healthy animals, since (for reasons outlined below) these are likely to be more sensitive and to give more reproducible results. The well-established in vivo safety pharmacology test systems in normal animals that are routinely used for the sensitive detection of functional changes in renal, gastrointestinal, respiratory or cardiovascular systems following single doses can be adapted, if necessary, for multiple doses, and this approach is suggested by regulators for the assessment of risk in special patient populations (ICH, 2001). Similarly, the immunotoxicity testing protocols recommended by regulators generally use healthy animals for the detection of small changes of potential relevance to immunosuppressed patient populations (ICH, 2006), rather than looking for changes in an animal model of immunosuppression.
8.3 Limitations of Efficacy/Disease Models as Toxicology Models
As already mentioned, there are many compelling theoretical reasons (scientific and nonscientific) why animal-disease models might be useful for toxicology data collection. Various examples of the successful elucidation of toxicity risk using these models are published in the literature (as reviewed from a toxicologic-pathology perspective by Morgan et al., 2013). However, published work typically does not reflect situations in which the hypothesis was not confirmed in the animal disease model or where other potential safety signals of uncertain relationship to the test article or uncertain relevance for humans appeared in these models. Thus, in practice, there are a variety of issues that need to be carefully considered by the toxicologist before embarking on the collection of safety end points from such studies. Neglecting to do so may result in generation of data that are misleading or uninterpretable but nonetheless must be submitted to regulatory authorities.
8.3.1 Lack of Validation as Safety/Toxicology Models
If they are validated at all, animal models of disease are usually validated as models for showing the efficacy of therapies intended to improve the disease state. This is very different from showing that a test article does not make the disease worse, or from looking for toxic effects in organs that are not directly affected by the disease (but which may be indirectly affected by the disease state, which complicates interpretation of any pathology). At the very least, some ad hoc validation with positive and negative controls (compounds currently used in the human disease being modelled that do or do not, respectively, have particular toxic liabilities in that particular population) is advisable. Without this, such a model will not provide a credible assurance of safety, nor should any identified adverse effects necessarily be considered translatable to a human population. Even then, validation in any universally meaningful sense would require the use of many positive and negative controls, utilising different pharmacological mechanisms of action, which even if available, would take many years and require prohibitive numbers of animals.
A good example of the difficulties of validating a disease model for the determination of toxic effects related to enhancement of a disease is the well-established 4-hydroxybutyl(butyl)nitrosamine model of bladder cancer in rodents, which was recently recommended by the US Food and Drug Administration (FDA) for the assessment of the potential for a pharmaceutical to promote bladder cancer in humans with existing preneoplastic bladder pathology (FDA, 2013). Various agents have been identified that both promote and suppress bladder carcinogenesis in this model, including some genotoxic agents that are believed to play similar roles in human bladder carcinogenesis (Wanibuchi et al., 1996). Unfortunately, agents that have not been associated with human bladder cancer (despite extensive monitoring of exposed populations, in some cases) have been found to promote bladder cancer in this model. These include ascorbic acid (vitamin C) (Fukushima et al., 1984) and the antidiabetic drug, rosiglitazone (Lubet et al., 2008). Furthermore, agents known to increase the risk of human bladder cancer, such as cyclophosphamide, have not acted as promoters in this model (Babaya et al., 1987). Hence, despite this being a well-established and well-characterised cancer model, its utility as a predictor of human safety with respect to the promotion of bladder cancer by nongenotoxic agents is doubtful. It is also useful to note that the collection of the data needed to reach this conclusion has taken many years and large numbers of animals; this kind of work would be very difficult to undertake de novo in order to validate another animal model of disease as a reliable predictor of human safety.
In practice, the difficulties inherent in fully understanding animal models of disease and how safety data obtained from them may translate to humans mean that once baseline safety data have been obtained in routine toxicology models, further safety data pertaining to specialised disease situations or special human patient populations are more reliably obtained from human trials.
8.3.2 Disease Models Rarely Have All the Elements of the Equivalent Human Disease
It is well accepted that animal models of disease seldom have all the elements of the equivalent human disease. This limits their applicability to safety screening – in many respects, much more than it does for efficacy screening, where a single predefined and validated end point can be selected, regardless of the lack of translatability of the other features of the disease in that particular model.
Type 2 diabetes is a good example in which there are a variety of models that have at least some features of the human disease, but none has all the features (including, of course, the accelerated atherosclerosis noted in human diabetes populations), and diabetes therapies that are effective in humans do not necessarily show efficacy in every model (reviewed by Calcutt et al., 2009; King, 2012). For this reason, none of these models would be likely to provide a reliable indication of all potential toxicities in the human diabetic population, and may demonstrate irrelevant toxicities that are unique to that particular disease model in that particular species.
Of course, type 2 diabetes is a complex disease and its underlying aetiology is not well understood, so the shortcomings of the various animal models are perhaps not surprising. However, even diseases with a very well understood pathogenesis are frequently not well modelled in animals. Cystic fibrosis is a monogenic disease in which a variety of mutations in the CFTR chloride channel gene produce a very consistent pattern of pathology in humans. However, the mice, pigs and ferrets with natural or engineered mutations in the corresponding CFTR channel that can be used as in vivo models do not fully recapitulate the disease as it is manifest in humans (Keiser and Engelhardt, 2011).
8.3.3 Limited Sensitivity Produced by Increased Interanimal Variability amongst Diseased Animals and/or Low Animal Numbers
Typically, the pathology produced in animal models of disease will be quite severe, frequently progressing over durations relevant for toxicity studies, with a fair amount of interanimal variability (almost certainly more at baseline than would be expected in a similarly sized group of healthy animals in a standard toxicology study). This is particularly true for surgical models of disease. Since efficacy effects need to be readily demonstrable in most or all animals at doses/exposures comparable to those to be used in humans, lower sensitivity introduced by greater variability is acceptable in efficacy models. However, a relatively small effect occurring in one or a few animals may be significant in a toxicology study; such an effect may be lost in the background variability of an animal disease model or, alternatively, chance findings related to the model and not the test article may be interpreted as test article-related.
An additional complication is that progression of disease in the model over the course of a study needs to be considered. This will add to variability even if the animals were originally randomised based on a disease metric. Disease progression (and hence other toxicity end points) may be influenced by the test article, and (of course) recovery from toxicity end points in a clinically deteriorating animal will be very difficult to assess. It should also be noted that the disease process and/or the strain of animal (often inbred) may affect test-article pharmacokinetics and metabolism such that toxicokinetic data need to be evaluated before disease-model data can be compared with data obtained from traditional toxicology studies. The usefulness of pooled samples for TK analysis from a group of individual animals with disease of differing severity also needs to be considered.