next up previous contents
Next: Practical Work - The Up: An Annotated Bibliography of Previous: Lecturing   Contents

Subsections


Assessment


Assessment
Postscript
Portable Document Format

Assessment is generally used:

  1. to provide information, to both teacher and student, about a student's subject understanding so as to guide future study.
  2. to certificate students for possible entry into further courses or employment selection.
  3. to elicit weaknesses in instruction.

Research into assessment is dominated by multiple choice objective testing, other common forms of testing include communication grids, branched true/false tests and ``mind maps''.

You will find the definitions of some words that are commonly used in the field of assessment (and the following text) here.


Multiple Choice Scoring Schemes

Friel and Johnstone 1978a
Friel, S. and Johnstone, A. H., Scoring Systems Which Allow For Partial Knowledge. Journal of Chemistry Education, 55(11), 717-719, 1978.

Provides a short introduction to differential weighting and confidence testing. Then presents a comparison of 4 scoring systems; 1 consistent with, and 2 variants of  (Willey 1960) and the standard form. The authors conclude that a scoring system which gives credit for partial knowledge may not alter students' rank ordering and may be less discriminatory but, a students' overall mark may more accurately represent their subject understanding.

Bao and Redish 2001
Bao, L. and Redish, E. F., Concentration Analysis: A Qualitative Assessment of Student States. American Journal of Physics, 69(7), S45-S55, 2001.

In this paper the authors present a measure to discern how students' multiple choice test responses are distributed. They suggest that concentration analysis can be used in the design and development of a research based multiple choice test.

Johnstone 1987
Johnstone, A. H., Can the Slipper Fit?-Grade-Related Criteria for School Science. School Science Review, 68(245), 737-744, 1987.

Presents a discussion of norm referenced and criterion referenced modes of assessment.

Dressel and Schmid 1953
Dressel, P. L. and Schmid, J., Some Modification of Multiple Choice Items. Educational and Psychological Measurement, 13, 574-595, 1953.

In this study the authors trialled four variants of the standard multiple choice tests. These were the

Free-Choice Test
Here the students were required to mark as many choices, as they thought necessary, so as to ensure that they had not omitted the correct answer.
Degree of Certainty Test
Here the student had to indicate, on a scale of 1-4, how certain he was that his one choice was correct.
Multiple-Answer Test
Here any item might have more than one correct answer. The student had to mark the choices he thought were correct. He would get credit for correct answers and be penalised for any incorrectly marked answers.
Two-Answer Test
Here two of the five responses were correct. A students' total score consisted of all their correct responses.
The authors report that there was some evidence that a students were required to examine more critically the test item when one of the modified versions was used.


Willey 1960
Willey, C. f., The Three-Decision Multiple-Choice Test: A Method of Increasing the Sensitivity of the Multiple-Choice Item. Psychology Review, 7, 475-477, 1960.

In this paper the author presents a special five-option multiple choice question where 3 items must be selected: the option which is thought to be definitely correct and the two options which are thought to be definitely incorrect. The marking system is as follows:

The author suggests that this method discriminates between the conscientious and the superficial, or impulsive, examinee.
Arnold and Arnold 1970
Arnold, J. C. and Arnold, P. L., On Scoring Multiple Choice Exams Allowing for Partial Knowledge. Journal of Experimental Education, 39(1), 8-13, 1970.

Using elementary game theory the authors present a multiple choice examination scoring procedure which gives credit for partial knowledge and controls the expected gain/penalty due to guessing. The authors present a comparison of their scoring system with four alternative procedures. In general, each scoring system had little effect on the higher and lower scoring students. However the relative positions of the middle grades differed considerably from one test to the next. This was attributed to the greater influence of guessing factors on the ``middle'' scores.

Aitkin 1967
Aitkin, L. R., Effect on Test Score Variance of Differential Weighting of Item Responses. Psychological Reports, 21(10), 585-590, 1967.

Presents, mathematically, the effects of differential response weightings on total test variance of multiple choice objective tests.

Rippey 1970
Rippey, R. M., Rationale for Confidence-Scored Multiple-Choice Tests. Psychological Reports, 27(5), 91-98, 1970.

Presents a scoring scheme which requires an examinee to indicate their confidence when answering multiple choice questions. Also advocates the inclusion of intrinsic items in multiple choice tests, because these items suggest that ``not all questions worth asking have single, impeccably defined answers''. (Intrinsic items require a distribution of belief over the options on a multiple choice test and do not have a unique answer).

Hasan et. al. 1999
Hasan, S., Bagayoko, D. and Kelley, E. L., Misconceptions and the Certainty of Response Index (CRI). Physics Education, 34(5), 294-299, 1999.

The Certainty of Response Index (CRI) provides a measure of the degree of certainty with which a student answers a multiple choice question. Here the student indicates, on a scale of 0-5, how certain he is that his answer is correct: using well established knowledge, concepts or laws. The authors recommend, and have used, this method to differentiate between students' misconceptions and lack of knowledge.

Friel and Johnstone 1988
Friel, S. and Johnstone A. H., Making Test Scores Yield More Information. Education in chemistry, 28(3), 46-49, 1998.

In this paper the authors recommend the use of caution indices (Sato 1975), in particular the use of the modified caution index (Harnish and Linn 1981), rather than simply using facility and discrimination values to assess student and question performance. That is by employing caution indices it is possible to identify anomalous response patterns to a particular question, and of a particular student.

Handy and Johnstone 1973a
Handy, J. and Johnstone, A. H, Reproducibility in Objective Testing. Education in chemistry, 10(2), 47-48, 1973.

Provides evidence that using common questions, rather than pre-tests, provides a more accurate indicator of performance. That is, common questions can be used as internal standards to compare the respective difficulty levels of two exams, or the respective abilities of two groups of students. Also presents a simple generalised formula to penalise a student's score for guessing.

S = C - $\displaystyle {\frac{W}{n-1}}$ (5.1)

Where S is the corrected score, C is the number of correct responses chosen, W is the number of incorrect responses chosen, and n is the number of possible responses in each question. If all questions have been answered by all students there is no difference between the ranking of corrected and uncorrected scores.





Factors Effecting Test Outcomes

Handy and Johnstone 1973b
Handy, J. and Johnstone, A. H, How Students Reason in Objective Tests. Education in chemistry, 10(3), 99-100, 1973.

Here the authors conclude that answers to multiple choice questions are mostly selected validly, with minimal blind guessing and that failure to answer comprehension questions chiefly arises through deficiencies in knowledge.

Friel and Johnstone 1978b
Friel, S. and Johnstone, A. H., A Review of the Theory of Objective Testing. School Science Review, 59(209), 733-738, 1978.

In this general review of multiple choice testing the following points are discussed:

  1. The effects of guessing.
  2. The effects of changing the initial response.
  3. The effect of item (question) order alteration.
  4. The optimum number of choices.
  5. The position response set: the number and order of a set of choices.
  6. The assessment of partial knowledge.
    • Differential weighting
    • Confidence testing

Friel and Johnstone 1979
Friel, S. and Johnstone, A. H., Does Position Matter?. Education in chemistry, 56(6), 175-175, 1976.

In this investigation the authors conclude that the position of the most plausible distractor, relative to the correct answer, significantly alters the difficulty of a multiple choice question. In particular the difficulty is decreased when the distractor is placed immediately before the correct answer.

Johnstone et. al. 1983
Johnstone, A. H., Macguire, P. R. P., Friel, S. and Morrison, E. W., Criterion-Reference Testing in Science-Thoughts Worries and Suggestions. ssr, (6), 628-633, 1983.

In this paper the authors discuss problems associated with multiple choice tests as instruments for criterion-reference testing. The following criterion-referenced tests:

are then presented, with advantages and disadvantages, as alternatives to the standard multiple choice exam.

Cassels and Johnstone 1984
Cassels, J. R. T. and Johnstone A. H., The Effect of Language on Student Performance on Multiple Choice Tests in Chemistry. Journal of Chemistry Education, 61(7), 613-615, 1978.

In this study, matched questions (the same questions, used in alternative tests) were used to assess the influence of language on multiple choice outcomes. The following results were reported:

key words:
substitution of simpler words brought about improved performance (e.g. choking for pungent)
Terms of quantity:
pairings of words such as ``most abundant'' appear easier to understand than ``least abundant''.
Negative forms:
In general, the removal of negative questioning (e.g. ``Which statement is true'', rather than, ``Which statement is not true'') appears to improve performance.
Large numbers of words and arrangement of clauses:
Long complex sentences proved to be more difficult than short questions written in short sentences.
Minor changes in parts of speech:
the choice of active or passive voice has little effect.
The authors then present, with references, theoretical and experimental research which may explain why the above results were obtained: in particular the thinking processes necessary to solve a question (i.e. number of thinking stages, ``chunking'' ability and capacity of working memory, see 9).

Tamir 1990
Tamir, P., Justifying the Selection of Answers in Multiple Choice Items. International Journal of Science Education, 12(5), 563-573, 1990.

Notes that, for a given multiple choice question, one third of all students choosing the correct option did so for the wrong reason. Advocates the use of ``best answer'' multiple choice items, in conjunction with a requirement that students provide a written justification as to why they chose a particular option. This enables the identification of ``misconceptions, missing links and inadequate reasoning among students who correctly answered the best answer''

Johnstone 1981
Johnstone, A. H., Diagnostic Testing in Science. In Lewy, A. and Nevo, D. (Eds.), Evaluation Roles in Education, London: Gordan and Breach, 1981.

Here Johnstone discusses the role of objective testing as a diagnostic tool for the identification of problems/weaknesses in teaching and learning of science concepts. The advantages, distorting factors and strategies for the effective use of diagnostic testing (in this instance multiple choice and communication grids) are covered.

Marcus 1963
Marcus, A., The Effect of Correct Response Location on the Difficulty Level of Multiple-Choice Questions. Journal of Applied Psychology, 47(1), 48-51, 1963.

In this study the authors agree with (Cronbach 1950) that the position of the correct item, in a multiple choice question, does not effect the difficulty level. They do suggest that the unequal attractiveness of distractors and, the sequential effects from item to item, may adversely influence the response to an individual question.

Cronbach 1950
Cronbach, W. J., Further Evidence on Response Sets and Test Design. Educational and Psychological Measurement, 10(), 3-31, 1950.

In this paper, and in his earlier paper (Cronbach 1946), Cronbach discusses the effects of ``response sets'' (personal ways of responding to test items). The nature of response sets, methods to control their influence on test validity and the design of better tests are all discussed in this paper. Also presents evidence that supports the hypothesis that multiple choice tests are ``nearly free'' from response sets.

Jessel and Sullins 1975
Jessel, J, C, and Sullins, W. L., The Effects of Keyed Response Sequencing of Multiple-Choice Items on Performance and Reliability. Journal of Experimental Measurement, 12(1), 45-48, 1975.

In this paper the authors report that their study does not support the notion that, in a multiple choice test, the correct answer has to be randomly sequenced and appear in each position an equal number of times.

Harnish and Linn 1981
Harnish, D. L. and Linn, R. L, Analysis of item response patterns: questionable test data and dissimilar practices. Journal of Experimental Measurement, 18(3), 133-146, 1981.

In this paper the authors present a comparative study of Response Pattern Indices (indices that measure the degree to which the response pattern for an individual is unusual). These indices can be used to identify students for whom the test is inappropriate, need more study, make careless mistakes, or posses sporadic study habits.



Alternatives to Multiple Choice Tests


Assessment Using Communication Grids

Johnstone 1988
Johnstone, A.H, Methods of Assessment Using Grids. Lab Talk, (10), 4-6, 1988.

Here an array of information is presented in the form of a grid: a set of numbered boxes. In response to a question the pupils are asked to consider the content of each box and decide which box (or combination of boxes) constitutes the most appropriate answer to the question. In some circumstances, the order in which boxes are chosen is important. A box may contain pictures, words, ideas, equations, formulae, structures, definitions, numbers and operators. In addition, a series of questions can be set using the same grid. A possible scoring scheme is also presented, see Figure 5.1:

Figure 5.1: Grid Scoring Scheme. View a Larger Image Here
\includegraphics[width=1\linewidth]{Figures/score.eps}

Mackenzie 1997
Mackenzie, D., TRIAD: A computer based assessment software. University of Derby, UK: Centre for Interactive Assessment Development, 1997.

This Commercial software is similar to communication grids. More information can be found here

Egan 1972
Egan, K., Structural Communication-A New Contribution to Pedagogy. Programmed Learning and Educational Technology, 9(2), 63-78, 1972.

Egan's communication grids are presented in this paper.





Assessment Using Branched True False Tests

Johnstone et. al. 1981
Johnstone, A. H. McAlpine, E. and MacGuire, P. R. P., Branching Trees and Diagnostic Testing. Journal for Further and Higher Education in Scotland, 2(1), 4-7, 1981.

In this paper the authors present a computerised branched true false test. They suggest that with this form of testing it is possible to test for wrong and mis-linked knowledge, wrong strategies and, assess the effectiveness of teaching/learning processes. Through the use of this form of test the authors report that they were able to identify a student with a misconception which persisted from `O' Grade through to final honours level.





Here are the definitions of some words used in the field of assessment.

Key Option:
The correct choice (option) in a multiple choice test item.
Item
An individual question or exercise in a test.
Criterion-Referenced Test:
Here the performance of an individual is measured against a standard or criteria rather than against the performance of others who take the same test.
Norm-Referenced Tests:
Here the performance of an individual is measured against other students. Results from norm-referenced tests provide information that compares a student's achievement with that of a representative sample.
Distractor:
An incorrect choice in a multiple-choice item (also called a foil)
Faculty Index (or Value):
the proportion of a class answering a given question correctly; measured on a scale of 0 $ \rightarrow$ 1
Discrimination Index (or Value):
The extent to which an item differentiates between high-scoring and low-scoring examinees; measured on a scale of 0 $ \rightarrow$ 1
Intrinsic Item
require a distribution of belief over the options on a multiple choice test and do not have a unique answer.
Certainty of Response Index (CRI):
provides a measure of the degree of degree of certainty with which a student answers each question.


next up previous contents
Next: Practical Work - The Up: An Annotated Bibliography of Previous: Lecturing   Contents
David Palmer 2002-11-06