Chapter 19
Assessing Teaching Effectiveness
Susan M. Rhind and Catriona E. Bell
Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
The accuracy of faculty evaluation decisions hinges on the integrity of the process and the reliability and validity of the evidence that is collected. (Berk, 2005)
Defining Teaching Effectiveness
Teaching effectiveness is a term that abounds in the literature, yet it is remarkably difficult to find a concise definition of it. Many articles launch into descriptions of the pros and cons of different strategies to measure teaching effectiveness, yet fail to define the construct that is being measured. The Oxford English Dictionary definition of effectiveness is “The degree to which something is successful in producing a desired result”; the question in the context of teaching is: what is the “desired result”?
For the purposes of this chapter, we define teaching effectiveness in veterinary education as “teaching that succeeds in enhancing student learning.”
Furthermore, we restrict this definition to that of the individual teacher, rather than assessment at overall course or curriculum level, by which point it becomes very difficult to disentangle the myriad of factors that contribute to overall achievement and the student experience. Such curriculum-level assessment of overall teaching effectiveness would tend to be covered in the context of overall outcomes assessment, for instance for the purposes of accreditation.
The terms “assessment” and “evaluation” of teaching effectiveness are often used interchangeably. In the context of students, Cook (2010) proposes the distinction that assessment focuses on the learner, whereas evaluation focuses on programs. Although not directly parallel, for the purposes of this chapter we adopt the term assessment as it applies to specific methods to gather data about the teaching, whereas evaluation is the more global process whereby this assessment information is used to make an overall judgment.
Why Assess Teaching Effectiveness?
As with student assessment, assessing teaching effectiveness at the individual level can be considered as either formative or summative (Brown and Ward-Griffin, 1994; Berk, 2013). Formative assessment contributes to the ongoing improvement and development of an individual’s teaching over time. Summative assessment may be for the purposes of annual review, or may be used to inform promotion and tenure decisions (Berk, 2013). In addition to this “teacher-centered” rationale, a robust system of assessment of teaching effectiveness is a prerequisite to ongoing institutional enhancement of student learning and the student experience.
What to Assess?
Ramsden (2003) describes six principles of “effective teaching” in higher education. In Figure 19.1 we align these to the three stages of “preactive,” “interactive,” and “postactive” described by O’Neill (1988), who, through a systematic review process, identified a “top 20” of factors influencing teaching effectiveness that could be grouped into one of the three categories. O’Neill reviews the evidence for various factors falling into each of his three categories, with examples such as the following:
- Preactive stage: planning, preparation, and clear learning objectives.
- Interactive stage: teacher enthusiasm and creation of a supportive learning environment, management of classroom dynamics.
- Postactive stage: feedback to students. In addition, although not included in O’Neill’s review, we would also include reflection on feedback from students, which in turn can inform the preactive and interactive stages in the future.
An additional framework specific to the context of clinical teaching is that used in the Stanford Faculty Development Program (SFDP; Litzelman et al., 1998, 1999). The SFDP “has sought to study and define the components of effective clinical teaching” (Litzelman et al., 1998, p. 688) and has been widely used, validated, and accepted within healthcare professions education. The framework is based around seven categories relating to observable teaching behaviors, and these can be used to help define clear objectives for an evaluation of teaching effectiveness within an individual school. Each of these elements also maps to the model illustrated in Figure 19.1. The seven categories are:
- Establishing a positive learning climate (interactive).
- Control of the teaching session (interactive).
- Communicating goals (preactive and interactive).
- Promoting understanding and retention (interactive and postactive).
- Evaluation of learner’s achievement of desired goals (postactive).
- Providing feedback to learners (interactive and postactive).
- Promoting self-directed learning (preactive, interactive, and postactive).
All three models discussed emphasize that the effective teacher is much more than an individual who can “perform” well in the classroom or other teaching setting. Hence, any process to assess teaching effectiveness should aim to capture evidence mapped to each of these broad domains in order to gain a holistic overview of teaching effectiveness.
How to Assess Teaching Effectiveness
As with all forms of assessment, it is important to use reliable, valid, and feasible methods (Snell et al., 2000). This becomes even more important where the assessment contributes to career decisions for faculty (Beckman et al., 2004; Berk, 2013). A poor assessment tool will at best lead to unhelpful, and at worst inaccurate, results. Berk (2013, p. 19) comments that most home-grown rating scales for assessing teaching effectiveness in higher education “do not meet even the most basic criteria for psychometric quality required by professional and legal standards… the serious concern is that decisions about the careers of faculty are being made with these instruments.”
Similarly, if we adopt general principles of outcomes-based education and assessment, the purpose of the teaching effectiveness assessment exercise or the decision that will result from it should be defined from the outset, and should then inform the types of evidence that are gathered rather than this working in the opposite direction, as shown in Figure 19.2.
Who Does the Evaluation?
Up to 15 potential sources of evidence relating to teaching effectiveness have been described (Berk, 2005, 2013). These can essentially be grouped into three categories, as illustrated in Figure 19.3:
- Evaluation by the target audience, i.e., the students (or residents).
- Evaluation from the teacher perspective, e.g., self-evaluation, evaluation by peers or other colleagues.
- Evaluation of the product of the teaching, e.g., evaluation of students’ abilities at certain stages of the curriculum by faculty, evaluation of graduates by employers.
It is recommended that multiple sources of evidence, and a minimum of three (Berk, 2013), are used to inform a comprehensive evaluation, with information being triangulated from multiple summative and formative sources (Brown and Ward-Griffin, 1994; Blackmore, 2005; Siddiqui, Jones-Dwyer, and Carr, 2007; Berk, 2013), and that principles of 360° multisource feedback (in which feedback is elicited from a variety of perspectives) be adopted (Boerboom et al., 2011b; Berk, 2013). The model illustrated in Figure 19.3 provides a global overview of the broad categories of evidence from four different domains; ideally, an overall teaching portfolio would address items from each of the four domains.
These domains can be considered in the context of the well-known model for evaluating educational outcomes by Kirkpatrick (1994). This model has been adapted for use by researchers conducting systematic reviews in order to classify various levels of evidence. We present a simple version of the model in Table 19.1 adapted to the context of assessing teaching effectiveness.
Table 19.1 An adaptation of Kirkpatrick’s hierarchy to the context of assessing teaching effectiveness
Description | Examples | |
Level 1 | Reaction | Student evaluations of teaching effectiveness, e.g., surveys Peer evaluation |
Level 2 | Learning | Learning outcome measures, e.g., summative assessments, North American Veterinary Licensing Examination (NAVLE) |
Level 3 | Behavior | Transfer of learning to workplace, e.g., employer evaluation |
Level 4 | Results/impact | Impact on society (e.g., clients) and the profession |
Source: Based on Kirkpatrick, 1994.
As noted by Steinert et al. (2006), the usefulness of this model is to consider it not necessarily as a hierarchy, but to look for different kinds of evidence that provide a more comprehensive overview of the area in question. Similarly, Yardley and Dornan (2012, p. 105) emphasize that “the purpose to which evidence is put influences its trustworthiness and the best way of synthesising it” – although their context is the appraisal of interventions in medical education, the conclusions are equally appropriate for our consideration of evidence as it relates to teaching effectiveness.
When considering measures of teaching effectiveness, in moving from Level 1 to Level 4 in Table 19.1, the less objective the measures tend to become, then the less definitive one can be that any observed change (for example in behavior) can be clearly linked to one factor, which, for the purposes of this chapter, is the teacher.
Evaluation: Target Audience Perspective
Student Evaluations of Teaching Effectiveness
Student evaluations of teaching effectiveness (SETE) are the most obvious source of data to inform an assessment of teaching effectiveness. Despite receiving a “mixed press” in the literature, they are undoubtedly a cornerstone of the evaluation armory (Surgenor, 2011) and are the first item that should be addressed in any assessment of teaching effectiveness.
The major argument against putting too much weight on SETE relates to the potential for them to be viewed as a “popularity or personality contest,” which may have no direct link to student achievement. Various studies have demonstrated the potential impact of factors such as providing chocolate to students before filling in evaluations or faculty personality (Felton, Mitchell, and Stinson, 2004; Youmans and Jee, 2007; Surgenor, 2011), as well as links with examination satisfaction, difficulty, and results (Schiekirka and Raupach, 2015). Many publications emphasize that SETE should not be used as the sole measure for the evaluation. For instance, Sproule (2002, p. 287) notes that “the exclusive use of the student evaluation of teaching data in the determination of instructor performance is tantamount to the promotion and practice of pseudoscience.”
Despite these reservations, there is no doubt that SETE are an essential component of any evaluation of teaching effectiveness (Berk, 2005; Marsh, 2007; Surgenor, 2011). Berk (2005, p. 50) comments: “Student ratings are a necessary source of evidence of teaching effectiveness for both formative and summative decisions, but not a sufficient source for the latter. Considering all of the polemics over its value, it is still an essential component of any faculty evaluation system.”
Berk (2013) emphasizes the fact that many “home-grown” rating scales are flawed and lack both reliability and validity. In a review of the situation relating to these rating scales, he describes the situation as “ugly,” and recommends psychometrician input into both the design and the analysis of the data generated by these evaluations. In reality, for many veterinary schools this may provide a challenge due to size and resources, but nevertheless as a minimum faculty must be aware of the limitations of in-house evaluations that have had no such input from relevant experts. A checklist for SETE is presented in Table 19.2.
Table 19.2 Checklist for student evaluations
Suggested criteria | Y/N |
If items are developed in house, have they been reviewed by a psychometrician? | |
If “no” to above, have you considered using validated items from other surveys? | |
Are the evaluations released/administered at the same time in the course (e.g., relative to assessments)? | |
Are identical instructions given to students for each evaluation? | |
Are students given the same (short) window to complete the evaluations? | |
Have faculty had input into these decisions? | |
Is it possible to have one administrator coordinating the whole process for consistency? |
Source: Adapted from Berk, 2013.
An additional important point relates to response rates to the evaluations, which it is recommended should be at least 60% (Richardson, 2005). This point should be facilitated by robust and consistent administrative support, as detailed in Table 19.2.
The Place for SETE
What is clear is that while the use of SETE as the exclusive measure of assessment of teaching effectiveness is not advisable, it would be a rare institution indeed that did not have a robust system of student evaluations forming a significant component of overall quality assurance processes. Indeed, accreditation and overall quality assurance mechanisms would be impossible without a significant component of student evaluation being built in. A linked point relates to what happens to the data once gathered. It has been shown, for example, that student evaluations in combination with individual consultations with faculty members are more powerful in terms of changing teaching behaviors than faculty merely receiving student evaluations alone (Wilkerson and Irby, 1998).
In addition to standard post-course surveys as a method for SETE, face-to-face student meetings and focus groups can also be valuable in gathering more qualitative data to explore themes that may emerge from survey data analysis.
Teaching Awards
It is generally considered that teaching awards provide relatively low-level evidence of effective teaching (Berk, 2005), and can be seen as a popularity contest rather than recognition of effective teaching per se. A review by Huggett et al. (2012) encompassing not only health professions education, but also professional and higher education, concluded that limited evidence exists on the design and utility of teaching awards. The review also highlighted potential negative (e.g., reactions from peers) as well as positive (e.g., personal satisfaction and prestige) consequences for recipients. The consequences are very dependent on institutional culture, and while it would be inappropriate to weight it too heavily, this type of evidence can certainly be built into an overall portfolio of evidence and, if supported by other evidence, can be an indicator of committed and excellent teaching.
Evaluation: Teacher Perspective
Peer Evaluation Methods
Peer evaluation can be used to assess and improve teaching, and has been incorporated into faculty development programs since the 1980s (Irby, 1983). The literature shows that a number of related terms exist, including peer observation of teaching, peer review of teaching, reflective partnerships, and peer coaching for teaching improvement. All of these describe activities in which colleagues evaluate and give feedback on areas for improvement and identification of the existing strengths of another individual’s teaching based on their observations and judgments, usually in a reciprocal fashion. However, within these models there can be considerable variation in terms of the status, background, and number of observers and what it is that they observe or judge. For example, some models involve one or more observers at a time, from a range of potential backgrounds, such as department colleague, subject expert, or educationalist (Siddiqui, Jones-Dwyer, and Carr, 2007), undertaking a range of activities that may include reviewing course documentation and materials, direct or video-based observation of teaching in the classroom or clinic, and peer discussion and reflection on sources of evidence such as student evaluations or video recordings of teaching sessions (Irby, 1983; Brown and Ward-Griffin, 1994; Wilkerson and Irby, 1998; Yon, Burnap, and Kohut, 2002; Siddiqui, Jones-Dwyer, and Carr, 2007; Boerboom et al., 2011b; Ruesseler et al., 2014).
Utility of Peer Evaluation Methods
The appropriateness of adopting peer evaluation methods to augment student evaluations and other measures of teaching effectiveness has been controversial at times, but can be argued by the fact that colleagues are better placed than students to comment on the curriculum content, instructional methods, and appropriateness of standards for excellence in that subject (Irby, 1983; Brown and Ward-Griffin, 1994). The utility of peer evaluation for summative and formative decisions has been debated, although a review of nursing literature strongly advocated that peer evaluation should be used only for formative purposes (Brown and Ward-Griffin, 1994). The overall consensus in the literature appears largely to concur that peer evaluation is particularly valuable for providing an opportunity to reflect on and improve teaching practice, rather than for informing summative decisions such as promotion and tenure.
Benefits of and Barriers to Implementing Peer Evaluation Methods
Peer evaluation methods can benefit all faculty members involved in the activity, with reviewees potentially showing enhanced reflective skills, improvements in teaching practices, and even positive impacts on promotion outcomes, for instance for medical faculty who emphasized teaching as a main activity (Irby, 1983), while peer reviewers themselves may be reenthused and motivated to reinvigorate their own teaching practices (Siddiqui, Jones-Dwyer, and Carr, 2007).
There are a number of potential barriers to implementing peer evaluation of teaching, which include a perceived lack of objectivity or reliability; lack of consensus on appropriate criteria for evaluating teaching; lack of time, money, or energy; scheduling conflicts; and the need for a trusting but nonfriendship-based relationship between colleagues (Irby, 1983; Brown and Ward-Griffin, 1994; Blackmore, 2005; Siddiqui, Jones-Dwyer, and Carr, 2007). It is important to select the observer carefully, and mutual expectations should be discussed and agreed in advance (Siddiqui, Jones-Dwyer, and Carr, 2007). Peer evaluations should be based around clearly defined criteria (Irby, 1983) and categorized behavioral objectives (Brown and Ward-Griffin, 1994), and undertaken by trained peer evaluators where possible (Yon, Burnap, and Kohut, 2002; Blackmore, 2005; Warman, 2015). Their reliability can be improved if evaluations are based on direct classroom observations of teaching rather than simply reviewing curriculum documentation and course materials (Irby, 1983). In addition, observers should never intervene during a teaching observation, and confidentiality should be maintained at all times by those involved (Siddiqui, Jones-Dwyer, and Carr, 2007).
Planning a Peer Evaluation
Brown and Ward-Griffin (1994, p. 304) summarize five key criteria that contribute to a successful peer evaluation process:
- Overall approach and criteria developed by local faculty with administrative support – must be relevant to local context.
- Peer evaluation only one component of overall faculty evaluation process, with main purpose to be formative – improvement of teaching and learning.
- Ensure system is equitable and fair – need to be cognizant of importance of mutual trust and support.
- Need trained observers undertaking multiple observations and including constructive feedback in their evaluations.
- Should only use summatively (e.g., promotion and tenure) if information is “carefully gathered, promptly reported and judiciously interpreted.”
Referring back to O’Neill’s (1988) levels in Figure 19.1, it would also seem sensible that any peer evaluation of teaching effectiveness adopts similar principles and comprises three key domains in terms of preactive, interactive, and postactive dimensions:
- An initial meeting to clarify the purpose and scope.
- An in-class observation.
- A feedback/debriefing session.
This approach is also endorsed in models such as the Integrated Assessment of Teaching (IAT) described by Osborne (1998).
Several studies of peer evaluation methods and tools in the medical/healthcare literature have been published, and these are summarized succinctly by Berk (2013). In contrast, limited evidence exists in the veterinary medical education literature, although one study did describe and evaluate the use of peer reflection meetings in combination with a validated evaluation instrument, the Maastricht Clinical Teaching Questionnaire (MCTQ), which combines self-reflection and student ratings, and addresses the domains of climate, modeling, coaching, articulation, and exploration (Boerboom et al., 2011b–2012). This study demonstrated that student ratings of the domains in the MCTQ were valid indicators of teaching performance for veterinary clinical faculty (Boerboom et al., 2012), and that the addition of peer reflection meetings led to deeper reflection among faculty and aided in the translation of student feedback into “concrete alternatives for teaching” (Boerboom et al., 2011b, p. e620). This concurs with other healthcare studies showing that peer evaluation can promote changes in reflective practice and the development of teaching methods (Siddiqui, Jones-Dwyer, and Carr, 2007).
The Place for Peer Evaluation
We consider that peer evaluation is an essential component of any program to assess teaching effectiveness. An additional benefit of peer evaluation is that it can also result in positive role modeling by faculty for students in terms of seeking, receiving, and acting on feedback (Brown and Ward-Griffin, 1994). However, ultimately in order to be successful a peer evaluation program needs strong support from the dean and heads of department (Irby, 1983), in addition to “faculty involvement, short but objective methods, trained observers, constructive feedback for faculty development, as well as open communication and trust” (Brown and Ward-Griffin, 1994, p. 299).
Self-Evaluation Methods
Self-evaluation is a less commonly used measure of teaching effectiveness, and while concerns have been expressed that such methods may not be completely reliable (Hartman and Nelson, 1992; Boerboom et al., 2009), some have been shown to be effective and valid methodologies using relatively robust study designs (Skeff, Stratos, and Bergen, 1992; Hewson, Copeland, and Fishleder, 2001; Cole et al., 2004).
Traditional post self-evaluation (Srinivasan et al., 2007) and pre/post self-evaluation (Steinert, Naismith, and Mann, 2012) of teaching competencies before and after completion of a faculty development program may be used. However, retrospective pre/post self-evaluation methods have actually been shown to be more valid measures of self-reported changes in teaching competencies (Skeff, Stratos, and Bergen, 1992; Hewson, Copeland, and Fishleder, 2001). Useful descriptors from a validated rating tool for self-assessment of competencies may range from “1 (I do not do this) to 5 (I’m highly competent at doing this),” while the corresponding tool for retrospective assessment of improvements in teaching may include descriptors such as “1 (no change [includes I would not want to do this or I was already highly effective at doing this]) to 5 (great deal of change or now I regularly do this)” (Hewson, Copeland, and Fishleder, 2001, p. 155).
Review of video-based and audio-based recordings of teaching sessions can also be incorporated into self-evaluation methodologies, with faculty initially reviewing recordings of their own teaching, before moving on to incorporate peer evaluation and discussion of the same materials with their colleagues (Elliot, Skeff, and Stratos, 1999). These and other self-assessment methodologies that include reflective elements can be a useful component of an assessment of teaching effectiveness, and can particularly be used for formative purposes, such as promoting the enhancement of teaching competencies (Cole et al., 2004; Boerboom et al., 2011b).
Objective Structured Teaching Exercises
A tool that has been described for use by either peers or (more commonly) faculty development staff, based on the well-known objective structured clinical examination (OSCE) for objective assessment of clinical skills, is the objective structured teaching exercises or OSTE (Morrison et al., 2003; Stone et al., 2003; Julian et al., 2012).
OSTE checklists typically look for evidence aligned to the phases of preactive, interactive, and postactive illustrated in Figure 19.1. Boillat et al. (2012) have published 12 tips on using the OSTE, which include clarifying the goal and target audience, identifying what teaching skills to focus on, and training “standardized learners” who will be taught during the OSTE. Integrating the OSTE into the local context is also emphasized, and almost all of the tips would be appropriate for any evaluation method.
However, while the OSTE may be worth considering, given the many opportunities that exist for fully contextualized, “real-life” observations of teaching to take place in veterinary schools, it is arguable whether it really adds significantly to the armory for assessing teaching effectiveness.
Evaluation: The Product of Teaching Perspective
Learning Outcome Measures
While in an ideal world a tool to link teacher effectiveness with the highest-level changes in Kirkpatrick’s (1994) model would be desirable, clearly this is fraught with difficulty. Perhaps Level 2 is the highest at which we can expect to be able to gather evidence that links to individual teacher effectiveness. Even at this level, as Jones (1989, p. 552) reflects, “it is doubtful whether any tertiary institution would have the resources and expertise – or the will – to produce a fair system of summative teacher evaluation based upon student learning outcomes.” There are very polarized views on this aspect; for example, Emery, Kramer, and Tian (2003, p. 45) state: “An evaluation of teaching effectiveness must be based on outcomes. Anything else is rubbish.” We would argue that a middle ground of utilizing such measures where appropriate (and feasible) to contribute to an overall portfolio of evidence is sensible if practical. Berk (2005) urges caution in their use and we would concur with this view, since cause and effect can become increasingly difficult to prove the further away from the teaching encounter the evaluation occurs.