Written Assessment

Chapter 15
Written Assessment

Jared Danielson¹ and Kent Hecker²

¹College of Veterinary Medicine, Iowa State University, USA

²Faculty of Veterinary Medicine, Cumming School of Medicine, University of Calgary, Canada

Introduction

Traditionally, written assessments are thought of as pencil-and-paper tests. However, many such tests today are completed using a computer, with little or no “writing” (and lots of “clicking”). Therefore, it is helpful to think of written assessments as those that can be completed without access to anything other than the assessment itself, including the medium used to administer it (paper, computer, etc.). Written assessments are generally used to measure knowledge/skills in the cognitive domain. The purposes of this chapter are to help you write effective assessment items in several common formats; design effective written tests; and evaluate the quality of your tests and items.

Creating Written Assessment Items

Individual test questions are referred to as items, which come in two varieties: selected response and constructed response. As implied by the names, selected-response items require examinees to select among two or more potential answers, and constructed-response items require them to construct an answer. Common selected-response formats include true–false, multiple choice, and matching, with more than one variety of each. Common constructed-response formats include short answer and essay.

Selected-Response versus Constructed-Response Formats

When planning any written item, whether it requires a selected or a constructed response, the test designer must consider what stimulus (item stem) the students will respond to and how they should respond (the correct answer). With selected-response items, the stimulus is the item stem, and the response is the choice of an option. With constructed-response formats, the stimulus is an item stem (question) and the response is something else, like an essay or short answer (see Box 15.2).

The advantage to selected-response formats is that they are easy to grade, because the judgment regarding what constitutes a correct answer is made prior to the students taking the test. Selected-response items also tend to have good psychometric properties, because there is no variation in answer format, and thus no variability in judging answers across respondents. The advantage to constructed-response formats is that they can provide students with more realistic tasks than just a selection of options. For instance, in the “real world” a veterinarian is more likely to produce a list of differential diagnoses than to choose from a list that someone else has created. All things considered, the advantages of selected-response items in terms of practicality and statistical validity make it worth the effort of carefully designing high-quality selected-response items that provide the most authentic task possible. Constructed-response items are best reserved for testing that cannot be done with selected-response items.

Selected-Response Items

There are many varieties of selected-response items. The following discussion provides basic rules of selected-response item design, and shows how to create two common types of items: one-correct-answer multiple choice and extended matching.

Box 15.2: Where’s the evidence?

What do the experts say regarding when to use selected-response versus constructed-response items? Consider these two quotes:

“Selected-response items are the most appropriate item format for measuring cognitive achievement or ability, especially higher order cognitive achievement or cognitive abilities, such as problem solving, synthesis, and evaluation.” (Downing, 2006, p. 288)

“Selected-response items are not appropriate for measuring students’ abilities to synthesize ideas, to write effectively, or to perform certain types of problem-solving operations.” (Popham, 2006, p. 215)

Hopefully, Popham (Emeritus Professor in the Graduate School of Education at University of California, Los Angeles) and Downing (Associate Professor of Medical Education at the University of Illinois at Chicago) would not object to our juxtaposing these two quotes. We do so simply to illustrate that it is not easy to find an incontrovertible rule associating certain learning outcomes with particular testing formats. Our recommendation is that you consider the specific knowledge or ability you wish to measure, and choose the format that seems to you to provide the most authentic measure of that task, taking into account factors such as validity, reliability, and practicality, as described in Part Four, Chapter 14: Concepts in Assessment.

Multiple-Choice (One-Correct-Answer) Questions

Figure 15.1 illustrates what a basic multiple-choice item looks like, and provides labels for basic question components: stem, options, answer, and distractors. Because many veterinary educators have taken many multiple-choice tests in their academic careers, and because on the surface multiple-choice questions all basically look the same, it is easy to assume that multiple-choice items are easy to write. However, multiple-choice questions, especially those that measure outcomes like principles or problem-solving, are actually quite difficult to write well. This section explains basic rules of writing effective multiple-choice items, and highlights some of the most common item flaws.

Image described by caption and surrounding text. — **Figure 15.1** The anatomy of a multiple-choice item.

Good Multiple-Choice Questions Measure Stated Learning Objectives from the Course

It is easy inadvertently to measure superfluous skills, such as knowledge of test-taking strategies, that do not address your learning objectives. The following strategies will help you reduce this threat to the validity of your questions:

Avoid using obviously wrong “freebee” distractors, just for fun or because you ran out of plausible options. (If you want to use a “fun” distractor, make it an additional option, ensuring that you still have two to four distractors in which you have some confidence.)
Avoid writing items that purposefully disadvantage students who know a lot about multiple-choice tests and thus are “test-wise.” Strategies used by test-wise students will be discussed more in the next section.
Avoid including questions regarding material that is trivial or obscure, which is sometimes done to produce the illusion of difficulty and/or a normal distribution of scores. Your goal is to provide a valid measure of student ability, not to produce a particular distribution of scores (see Part Four, Chapter 14: Concepts in Assessment).
Think carefully about the appropriate difficulty of your distractors. Most of the time distractors are not “completely” wrong, and frequently the answer is not correct to the exclusion of all other potentially correct answers; rather, answers fall on a continuum from very wrong to very right. As seen in Figure 15.2, your task is to come up with an answer that is clearly correct to prepared students, and distractors that are clearly wrong to prepared students, but attractively plausible to the unprepared. Figure 15.3 illustrates this rule with a practical example. The correct answer is the technique that most commonly produces the error, but the “kind of right” distractors are unnecessarily tricky because they are also right, but less specific than the right answer. The “wrong, but attractively plausible to the unprepared” distractors are clearly wrong, but are common errors that create problems when making blood smears, and therefore are plausible to those who did not prepare. The “very wrong” options are obviously “throw-away” distractors. Students may find them entertaining, but even the least-prepared student would not choose them. It can be very difficult to come up with three or four compelling distractors for any given item. If this occurs, take comfort in the knowledge that three options (the answer and two good distractors) are typically sufficient for appropriate statistical properties, and that few items, even professionally written and edited ones, have more than two “functional” distractors (meaning that they are chosen by more than a handful, maybe 5%, of respondents; Downing, 2006).

A diagram of a horizontal labeled double-headed arrow for continuum of “rightness” to “wrongness” of multiple-choice options. — **Figure 15.2** Continuum of “rightness” to “wrongness” of multiple-choice options.

Effective Multiple-Choice Items Measure a Single Objective with Precision by Providing a Clear and Appropriate Stem, and Options That Address the Question Posed in the Stem

Generally such items include most of the relevant detail in the stem, which tends to be longer than the options. The options are short, clear, and address the specific question posed by the stem. The item should be answerable by covering the options and reading the stem. For this reason, questions that begin with phrases such as “Which of the following is true about …?” are generally not recommended.

Effective Multiple-Choice Items Distribute the Location of the Correct Answer Evenly among the Various Answer Positions

Test-wise students are accustomed to the correct answer occupying the extreme positions less frequently than the center positions.

Specific Rules for Constructing Multiple-Choice Items

This section illustrates some common item flaws in medical sciences education. This list is not comprehensive; selected-item enthusiasts are encouraged to explore the resources in Box 15.5. Flaws 1–4 allow students to rule out distractors based on how the question was constructed rather than what they know about the content. Flaws 5-8 are unnecessarily tricky, or introduce other difficulties that frustrate students and compromise the validity of the measure. In the following examples the correct answer, where applicable, is marked with bold text, flawed options with an asterisk.

1Avoid Providing Grammatical Cues Regarding Correct/Incorrect Options

Grammatical cues are quite common in selected-response item writing. In Example 1, options a, b, and d are all adjectives and align with the stem grammatically; option c, a noun, does not. In cases like this, the item writer probably initially wrote the item with all nouns or all adjectives in the options, but later made some changes (such as revising how the stem was constructed) and did not revise all of the options correctly. Your students will know that you would not be careless enough to misalign the correct answer grammatically, so they will automatically know that the misaligned answer is not right. Similarly, in Example 2, the answer cannot be “plum,” or any other fruit beginning with a consonant, because the stem ends in “an.” These kinds of errors can usually be avoided with careful proofreading.

Examples:

1. The most important factor to consider when creating a test item is whether or not it is
1. a. valid
2. b. reliable
3. c. practicality*
4. d. readable.
2. The fruit that, when eaten once a day will keep the doctor away, is an
1. a. apple
2. b. apricot
3. c. orange
4. d. plum.*

2Avoid Questions That Provide a Comprehensive Subset of Options

Case and Swanson (2002) subdivide this principle into two: making sure that a subset is not collectively exhaustive; and that all distractors should be homogenous (e.g., fall into the same category). In Example 3, the first three options comprise a relatively exhaustive subset (comparison between alprazolam and amitriptyline). Option d is conceptually quite different, exploring one potential side effect of alprazolam. Test-wise students will guess that the option that is unlike the comprehensive subset is wrong. Such items give the impression (usually correctly) that the item author created some options from a theme regarding the target objective (in this case, comparison between the efficacy and side effects of alprazolam and amitriptyline), ran out of ideas, and made up another option just to include a prescribed number of distractors. This item might also be considered unnecessarily tricky, depending on how an expert would judge efficacy and the importance/likelihood of various side effects.

Example:

1. You are presented with Max, a 2-year-old male, neutered Labrador
Retriever who, over the past six months, has been demonstrating increased anxiety when left alone by his owners. Max appears healthy on physical exam and has no history of injury or illness. Max’s owners are hoping for a medication that will alleviate his symptoms while he participates in behavior modification therapy. For a patient like Max, alprazolam
1. a. is more efficacious with fewer potential serious side effects than amitriptyline.
2. b. is just as efficacious, with a comparable number of potential serious side effects to amitriptyline.
3. c. is less efficacious, with more potential serious side effects than amitriptyline.
4. d. can be administered confidently because he does not have a history of heart disease.*

3Beware of Convergence

Convergence occurs when options that contain multiple elements overrepresent the correct elements in the distractors. When reviewing items for faculty or students, I (JD) exploit this item flaw more than any other item flaw to correctly answer veterinary-related questions about which I know nothing. This flaw happens when you write the correct answer, and then modify it to create the other options, recycling elements from the correct option. In Example 4, all of the correct elements (the body, counting area, and feathered edge) are found in at least two of the distractors, and come together nicely in option d.

1. A diagnostic blood smear contains which essential elements?
1. a. the body, the counting area, and the border
2. b. the body, the base, and the feathered edge
3. c. the droplet, the counting area, and the feathered edge
4. d. the body, the counting area, and the feathered edge
5. e. the droplet, the base, and the counting area.

4Make Options Similar in Length and Specificity

Your students can often tell when you have taken more care to craft the correct answer than the distractors, because the correct answer is longer, more detailed, or more nuanced than the distractors. Therefore, make all options similar in length and similarly detailed/nuanced. In Example 5, the correct answer demonstrates a level of care/specificity that is not found in the other options.

Example:

1. Dr. Gutierrez has just diagnosed Buttercup, a 10-year-old German Shepherd, with hemangiosarcoma. Dr. Gutierrez knows that Buttercup’s owner, Ms. Greene, has had some difficulty understanding explanations and following discharge instructions in the past. Given that information, to what element of the SPIKES protocol should Dr. Gutierrez pay particular attention in sharing this information with Ms. Greene?
1. a. Step 1: Setting up the interview.
2. b. Step 2: Assessing the patient’s perception.
3. c. Step 4: Giving knowledge and information (particularly taking care to give information in small chunks, pausing frequently to assess understanding).
4. d. Step 5: Addressing emotions with empathetic responses.

5List Options in a Logical Order/Consistently

In the case of numerical options, you risk introducing unnecessary and irrelevant complexity when answer options within the same question vary in format or range, and when the options are not presented in sequential order. In Example 6, the answers are confusing because they are not in numerical order. In Example 7, not only are the questions in a confusing order, but one is given in the form of a range, and they vary in terms of how they employ decimals. Keep in mind that if using electronic testing with this sort of question, you will want to disable features that automatically scramble option order.

Examples:

1. How many mg of Enrofloxacin are needed for a 20 kg dog if the dose is 2.5 mg/kg?
1. a. 60 mg
2. b. 30 mg
3. c. 40 mg
4. d. 50 mg.
2. How much does a 58 lb dog weigh in kg?
1. a. 26.3
2. b. 27
3. c. 20–30
4. d. 0.26
5. e. 2.63.

6Avoid Using None of the Above

Questions that employ none of the above as an option are problematic, because students can usually imagine some better correct answer than the best one on the list, and are left to decide whether you wish them to choose the best one on the list, or the better option you might be thinking of. If you use none of the above, be sure that all the alternatives are clearly, unambiguously, and completely incorrect. In the case of Question 8, option a is a recommended approach, but there are several other options that many would consider as good or preferable, so it is not clear whether a or e is the best answer.

1. You are conducting a study to determine which of two approaches to teaching surgery produces superior learning gains. As part of the study, two blinded raters score each student’s performance. Which statistical test would you use to compare the raters’ scores in order to estimate reliability?
1. a. intraclass correlation
2. b. linear regression
3. c. analysis of variance
4. d. t-test
5. e. none of the above.*

7Avoid Using All of the Above

Using all of the above is problematic for two reasons. First, students only have to rule out one of the other distractors to know that all of the above is incorrect. Second, if all of the above is the intended correct answer, all of the other options should be clearly right, and equally right. If some of the other options are more right than others, students are forced to determine whether you intend them to choose the one best answer, or all plausible answers. In Example 9, all three strategies could address the problem, although options a or c seem most promising. Prepared students will spend an inordinate amount of time trying to guess what the item writer was thinking.

1. You are practicing creating blood smears that will make the faculty and staff in the pathology lab happy, but you keep producing smears that cover the entire slide and do not show a feathered edge. What would be the best strategy for solving this problem?
1. a. Use a smaller drop of blood for making your smear.
2. b. Make sure you are not moving your spreader slide too slowly.
3. c. Make sure you are not holding your slide lower than a 30° angle.
4. d. All of the above.*

Box 15.3: Example of an extended matching question

Theme:	Statistical Tests
Options:	A. ANCOVA	F. Multiple Regression
	B. ANOVA	G. Paired Samples t-test
	C. Cronbach’s Alpha	H. Path Analysis
	D. Independent Samples t-test	I. Rasch Analysis
	E. Linear Regression	J. Repeated Measures ANOVA
Lead-In:	For each research scenario described below, select the most appropriate statistical test.
Stems:	1. An instructor is interested in determining whether students perform better on his final exam when
	they chew bubble gum or when they do not. He wants to have one section of his course chew bubble
	gum during the final, while the other section does not, and compare the average final exam scores
	between sections.
	ANS: D
	2. The same instructor mentioned in Question 1 would like to find out whether students score better
	under either of two different lighting conditions. He proposes to have all of his students answer some
	questions in dim lighting and answer other questions in bright lighting, and then compare students’
	average scores achieved under dim lighting conditions with their average scores achieved under bright
	lighting conditions.
	ANS: G

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Tags: Veterinary Medical Education

Oct 15, 2017 | Posted by admin in GENERAL | Comments Off

Veterian Key

Fastest Veterinary Medicine Insight Engine