Next: Practical Work - The
Up: An Annotated Bibliography of
Previous: Lecturing
  Contents
Subsections
Assessment
Assessment
Postscript
Portable Document Format
Assessment is generally used:
- to provide information, to both teacher and student, about a
student's subject understanding so as to guide future study.
- to certificate students for possible entry into further courses
or employment selection.
- to elicit weaknesses in instruction.
Research into assessment is dominated by multiple choice objective
testing, other common forms of testing include communication grids,
branched true/false tests and ``mind maps''.
You will find the definitions of some words that are commonly used in
the field of assessment (and the following
text) here.
Multiple Choice Scoring Schemes
Friel and Johnstone 1978a
Friel, S. and
Johnstone, A. H., Scoring Systems Which Allow For Partial
Knowledge. Journal of Chemistry Education, 55(11), 717-719, 1978.
Provides a short
introduction to differential weighting and confidence testing.
Then presents a comparison of 4 scoring systems; 1 consistent
with, and 2 variants of (Willey 1960) and the standard form. The
authors conclude that a scoring system which gives credit for
partial knowledge may not alter students' rank ordering and may
be less discriminatory but, a students' overall mark may more
accurately represent their subject understanding.
Bao and Redish 2001
Bao, L. and Redish, E.
F., Concentration Analysis: A Qualitative Assessment of
Student States. American Journal of Physics, 69(7), S45-S55, 2001.
In this paper
the authors present a measure to discern how students' multiple
choice test responses are distributed. They suggest that
concentration analysis can be used in the design and development
of a research based multiple choice test.
Johnstone 1987
Johnstone, A. H., Can the Slipper
Fit?-Grade-Related Criteria for School Science. School Science Review,
68(245), 737-744, 1987.
Presents a discussion of
norm referenced and criterion referenced modes of assessment.
Dressel and Schmid 1953
Dressel, P. L. and Schmid,
J., Some Modification of Multiple Choice Items. Educational and Psychological Measurement,
13, 574-595, 1953.
In this study the authors trialled four
variants of the standard multiple choice tests. These were the
- Free-Choice Test
- Here the students were required to mark
as many choices, as they thought necessary, so as to ensure that
they had not omitted the correct answer.
- Degree of Certainty Test
- Here the student had to indicate, on
a scale of 1-4, how certain he was that his one choice was correct.
- Multiple-Answer Test
- Here any item might have more than one
correct answer. The student had to mark the choices he thought were
correct. He would get credit for correct answers and be penalised
for any incorrectly marked answers.
- Two-Answer Test
- Here two of the five responses were correct. A
students' total score consisted of all their correct responses.
The authors report that there was some evidence that a students were
required to examine more critically the test item when one of
the modified versions was used.
Willey 1960
Willey, C. f., The Three-Decision
Multiple-Choice Test: A Method of Increasing the Sensitivity of
the Multiple-Choice Item. Psychology Review,
7, 475-477, 1960.
In this paper the author presents a
special five-option multiple choice question where 3 items must
be selected: the option which is thought to be definitely correct
and the two options which are thought to be definitely incorrect.
The marking system is as follows:
- 3 marks if the correct answer is correctly designated as
definitely correct.
- 2 marks if the correct answer is not designated as
definitely correct or definitely incorrect.
- 0 marks if the correct answer is designated as definitely incorrect.
The author suggests that this method discriminates between the
conscientious and the superficial, or impulsive, examinee.
Arnold and Arnold 1970
Arnold, J. C. and Arnold, P.
L., On Scoring Multiple Choice Exams Allowing for Partial
Knowledge. Journal of Experimental Education, 39(1), 8-13, 1970.
Using elementary game
theory the authors present a multiple choice examination scoring
procedure which gives credit for partial knowledge and controls
the expected gain/penalty due to guessing. The authors present a
comparison of their scoring system with four alternative
procedures. In general, each scoring system had little effect on
the higher and lower scoring students. However the relative
positions of the middle grades differed considerably from one test
to the next. This was attributed to the greater influence of
guessing factors on the ``middle'' scores.
Aitkin 1967
Aitkin, L. R., Effect on Test Score
Variance of Differential Weighting of Item Responses. Psychological Reports,
21(10), 585-590, 1967.
Presents, mathematically, the effects
of differential response weightings on total test variance of
multiple choice objective tests.
Rippey 1970
Rippey, R. M., Rationale for
Confidence-Scored Multiple-Choice Tests. Psychological Reports,
27(5), 91-98, 1970.
Presents a scoring scheme which requires
an examinee to indicate their confidence when answering multiple
choice questions. Also advocates the inclusion of intrinsic items
in multiple choice tests, because these items suggest that ``not
all questions worth asking have single, impeccably defined
answers''. (Intrinsic items require a distribution of belief over
the options on a multiple choice test and do not have a unique
answer).
Hasan et. al. 1999
Hasan, S., Bagayoko,
D. and Kelley, E. L., Misconceptions and the Certainty of Response
Index (CRI). Physics Education, 34(5), 294-299, 1999.
The
Certainty of Response Index (CRI) provides a measure of the
degree of certainty with which a student answers a multiple choice
question. Here the student indicates, on a scale of 0-5, how
certain he is that his answer is correct: using well established
knowledge, concepts or laws. The authors recommend, and have used,
this method to differentiate between students' misconceptions and
lack of knowledge.
Friel and Johnstone 1988
Friel, S. and Johnstone
A. H., Making Test Scores Yield More Information. Education in chemistry,
28(3), 46-49, 1998.
In this paper the authors recommend the
use of caution indices (Sato 1975), in particular the use of the
modified caution index (Harnish and Linn 1981), rather than simply using
facility and discrimination
values to assess student and question performance. That is by
employing caution indices it is possible to identify anomalous
response patterns to a particular question, and of a particular
student.
Handy and Johnstone 1973a
Handy, J. and
Johnstone, A. H, Reproducibility in Objective Testing. Education in chemistry,
10(2), 47-48, 1973.
Provides evidence that using common
questions, rather than pre-tests, provides a more accurate
indicator of performance. That is, common questions can be used as
internal standards to compare the respective difficulty levels of
two exams, or the respective abilities of two groups of students.
Also presents a simple generalised formula to penalise a student's
score for guessing.
S = C -  |
(5.1) |
Where S is the corrected score, C is the number of correct
responses chosen, W is the number of incorrect responses chosen, and
n is the number of possible responses in each question. If all
questions have been answered by all students there is no
difference between the ranking of corrected and uncorrected scores.
Factors Effecting Test Outcomes
Handy and Johnstone 1973b
Handy, J. and
Johnstone, A. H, How Students Reason in Objective
Tests. Education in chemistry, 10(3), 99-100, 1973.
Here the authors conclude that
answers to multiple choice questions are mostly selected validly,
with minimal blind guessing and that failure to answer
comprehension questions chiefly arises through deficiencies in
knowledge.
Friel and Johnstone 1978b
Friel, S. and
Johnstone, A. H., A Review of the Theory of Objective
Testing. School Science Review, 59(209), 733-738, 1978.
In this
general review of multiple choice testing the following points are
discussed:
- The effects of guessing.
- The effects of changing the initial response.
- The effect of item (question) order alteration.
- The optimum number of choices.
- The position response set: the number and order of a set of
choices.
- The assessment of partial knowledge.
- Differential weighting
- Confidence testing
Friel and Johnstone 1979
Friel, S. and Johnstone,
A. H., Does Position Matter?. Education in chemistry, 56(6), 175-175, 1976.
In
this investigation the authors conclude that the position of the
most plausible distractor, relative to the correct answer,
significantly alters the difficulty of a multiple choice
question. In particular the difficulty is decreased when the
distractor is placed immediately before the correct answer.
Johnstone et. al. 1983
Johnstone, A. H.,
Macguire, P. R. P., Friel, S. and Morrison, E.
W., Criterion-Reference Testing in Science-Thoughts Worries
and Suggestions. ssr,
(6), 628-633, 1983.
In this paper the authors discuss
problems associated with multiple choice tests as instruments for
criterion-reference testing. The following criterion-referenced
tests:
- batteries of true-false items,
- structural communication grids,
- concept linkages and,
- multiple choice tests which test for partial knowledge
are then presented, with advantages and disadvantages, as
alternatives to the standard multiple choice exam.
Cassels and Johnstone 1984
Cassels, J. R. T. and
Johnstone A. H., The Effect of Language on Student Performance on
Multiple Choice Tests in Chemistry. Journal of Chemistry Education,
61(7), 613-615, 1978.
In this study, matched questions (the
same questions, used in alternative tests) were
used to assess the influence of language on multiple choice
outcomes. The following results were reported:
- key words:
- substitution of simpler words brought about
improved performance (e.g. choking for pungent)
- Terms of quantity:
- pairings of words such as ``most
abundant'' appear easier to understand than ``least abundant''.
- Negative forms:
- In general, the removal of negative
questioning (e.g. ``Which statement is true'', rather than,
``Which statement is not true'') appears to improve performance.
- Large numbers of words and arrangement of clauses:
- Long
complex sentences proved to be more difficult than short
questions written in short sentences.
- Minor changes in parts of speech:
- the choice of active or
passive voice has little effect.
The authors then present, with references, theoretical and
experimental research which may explain why the above results were
obtained: in particular the thinking processes necessary to solve a
question (i.e. number of thinking stages, ``chunking'' ability and
capacity of working memory, see 9).
Tamir 1990
Tamir, P., Justifying the Selection of
Answers in Multiple Choice Items. International Journal of Science Education,
12(5), 563-573, 1990.
Notes that, for a given multiple
choice question, one third of all students choosing the correct
option did so for the wrong reason. Advocates the use of ``best
answer'' multiple choice items, in conjunction with a requirement
that students provide a written justification as to why they
chose a particular option. This enables the identification
of ``misconceptions, missing links and inadequate reasoning
among students who correctly
answered the best answer''
Johnstone 1981
Johnstone, A. H., Diagnostic Testing in
Science. In Lewy, A. and Nevo, D. (Eds.), Evaluation Roles in
Education, London: Gordan and Breach, 1981.
Here Johnstone
discusses the role of objective testing as a diagnostic tool for the
identification of problems/weaknesses in teaching and learning of
science concepts. The advantages, distorting factors and strategies
for the effective use of diagnostic testing (in this instance
multiple choice and communication grids) are covered.
Marcus 1963
Marcus, A., The Effect of Correct
Response Location on the Difficulty Level of Multiple-Choice
Questions. Journal of Applied Psychology, 47(1), 48-51, 1963.
In this study the
authors agree with (Cronbach 1950) that the position of the correct
item, in a multiple choice question, does not effect the
difficulty level. They do suggest that the unequal attractiveness
of distractors and, the sequential effects from item to item, may
adversely influence the response to an individual question.
Cronbach 1950
Cronbach, W. J., Further Evidence on
Response Sets and Test Design. Educational and Psychological Measurement, 10(), 3-31, 1950.
In this
paper, and in his earlier paper (Cronbach 1946), Cronbach discusses the
effects of ``response sets'' (personal ways of responding to test
items). The nature of response sets, methods to control their
influence on test validity and the design of better tests are all
discussed in this paper. Also presents evidence that supports the
hypothesis that multiple choice tests are ``nearly free'' from
response sets.
Jessel and Sullins 1975
Jessel, J, C, and Sullins,
W. L., The Effects of Keyed Response Sequencing of Multiple-Choice
Items on Performance and Reliability. Journal of Experimental Measurement,
12(1), 45-48, 1975.
In this paper the authors report that
their study does not support the notion that, in a multiple
choice test, the correct answer has to be randomly sequenced and
appear in each position an equal number of times.
Harnish and Linn 1981
Harnish, D. L. and Linn, R.
L, Analysis of item response patterns: questionable test data and
dissimilar practices. Journal of Experimental Measurement, 18(3), 133-146, 1981.
In this
paper the authors present a comparative study of Response Pattern
Indices (indices that measure the degree to which the response
pattern for an individual is unusual). These indices can be used to
identify students for whom the test is inappropriate, need more
study, make careless mistakes, or posses sporadic study habits.
Alternatives to Multiple Choice Tests
Assessment Using Communication Grids
Johnstone 1988
Johnstone, A.H, Methods of
Assessment Using Grids. Lab Talk, (10), 4-6, 1988.
Here an
array of information is presented in the form of a grid: a set
of numbered boxes. In response to a question the pupils are
asked to consider the content of each box and decide which box
(or combination of boxes) constitutes the most appropriate
answer to the question. In some circumstances, the
order in which boxes are chosen is important. A box may
contain pictures, words, ideas, equations, formulae, structures,
definitions, numbers and operators. In addition, a series of
questions can be set using the same grid. A possible scoring
scheme is also presented, see Figure 5.1:
Mackenzie 1997
Mackenzie, D., TRIAD: A computer based
assessment software. University of Derby, UK: Centre for
Interactive Assessment Development, 1997.
This Commercial software
is similar to communication grids. More information can be found
here
Egan 1972
Egan, K., Structural Communication-A
New Contribution to Pedagogy. Programmed Learning and Educational Technology,
9(2), 63-78, 1972.
Egan's communication grids are presented
in this paper.
Assessment Using Branched True False Tests
Johnstone et. al. 1981
Johnstone,
A. H. McAlpine, E. and MacGuire, P. R. P., Branching Trees and
Diagnostic Testing. Journal for Further and Higher Education in
Scotland, 2(1), 4-7, 1981.
In this paper the authors present a
computerised branched true false test. They suggest that with this
form of testing it is possible to test for wrong and mis-linked
knowledge, wrong strategies and, assess the effectiveness of
teaching/learning processes. Through the use of this form of test
the authors report that they were able to identify a student with a
misconception which persisted from `O' Grade through to final
honours level.
Here are the definitions of some words used in
the field of assessment.
- Key Option:
- The correct choice (option) in a multiple
choice test item.
- Item
- An individual question or exercise in a test.
- Criterion-Referenced Test:
- Here the performance of an
individual is measured against a standard or criteria rather than
against the performance of others who take the same test.
- Norm-Referenced Tests:
- Here the performance of an
individual is measured against other students. Results from
norm-referenced tests provide information that compares a student's
achievement with that of a representative sample.
- Distractor:
- An incorrect choice in a multiple-choice
item (also called a foil)
- Faculty Index (or Value):
- the proportion
of a class answering a given question correctly; measured on a scale of
0
1
- Discrimination Index (or Value):
- The
extent to which an item differentiates between high-scoring and
low-scoring examinees; measured on a scale of
0
1
- Intrinsic Item
- require a distribution of belief over the
options on a multiple choice test and do not have a unique answer.
- Certainty of Response Index (CRI):
- provides a measure of the
degree of degree of certainty with which a student answers each
question.
Next: Practical Work - The
Up: An Annotated Bibliography of
Previous: Lecturing
  Contents
David Palmer
2002-11-06