University of Leicester
Context and Inference in Language Testing.pdf (429.07 kB)

Context and Inference in Language Testing

Download (429.07 kB)
posted on 2015-09-23, 10:32 authored by Glenn Fulcher
It is arguably the case that “The purpose of language testing is always to render information to aid in making intelligent decisions about possible courses of action” (Carroll, 1961, p. 314). This holds true whether the decisions are primarily pedagogic, or affect the future education or employment of the test taker. If fair and useful decisions are to be made, three conditions must hold. Firstly, valid inferences must be made about the meaning of test scores. Secondly, score meaning must be relevant and generalizable to a real-world domain. Thirdly, score meaning should be (at least partially) predictive of, post-decision performance. If any of these conditions are not met the process of assessment and decision making may be questioned not only in theory, but in the courts (Fulcher, 2014a). It is therefore not surprising that historically, testing practice has rested on the assumption that language competence, however defined, is a relatively stable cognitive trait. This is expressed clearly in classic statements of the role of measurement in the ‘human sciences’, such as this by the father of American psychology, James McKeen Cattell: One of the most important objects of measurement…is to obtain a general knowledge of the capacities of a man by sinking shafts, as it were, at a few critical points. In order to ascertain the best points for the purpose, the sets of measures should be compared with an independent estimate of the man’s powers. We thus may learn which of the measures are the most instructive (Cattell, 1890, p. 380). The purely cognitive conception of language proficiency (and all human ability) is endemic to most branches of psychology, and psychometrics. This strong brand of realism assumes that variation in test scores is a direct causal effect of the variation of the trait within an individual (see the extensive discussion of validity theory in Fulcher, 2014b). This view of the world entails that any contextual feature that causes variation is a contaminant that pollutes the score. This is referred to as ‘construct-irrelevant variance’ (Messick, 1989, pp. 38–9). The standardization of testing processes from presentation to administration and scoring, is designed to minimize the impact of context on scores. In some ways, a good test is like an experiment, in the sense that it must eliminate or at least keep constant all extraneous sources of variation. We want our tests to reflect only the particular kind of variation in knowledge or skill that we are interested in at the moment (Carroll, 1961, p. 319). There are also ethical and legal imperatives that encourage this approach to language testing and assessment. If the outcomes of a test are high-stakes, it is incumbent upon the test provider to ensure that every test taker has an equal chance of achieving the same test score if they are of identical ability. Score variation due to construct-irrelevant factors is termed ‘bias.’ If any test taker is disadvantaged by variation in the context of testing, and particularly if this is true of an identifiable sub-group of the test taking population, litigation is likely. Language tests are therefore necessarily abstractions from real life. The degree of removal may be substantial, as in the case of a multiple-choice test, or less distant, in the case of a performance-based simulation. However, tests never reproduce the variability that is present in the real world. One analogy that illustrates the problem of context is that of tests for life guards. Fulcher (2010, pp. 97–100) demonstrates the impossibility of reproducing in a test all the conditions under which a life guard may have to operate – weather conditions, swell, currents, tides, distance from shore, victim condition and physical build. The list is potentially endless. Furthermore, health and safety regulations would preclude replicating many of the extremes that could occur within each facet. The solution is to list constructs that are theoretically related to real world performance, such as stamina, endurance, courage, and so on. The test of stamina (passive drowning victim rear rescue and extraction from a swimming pool, using an average weight/size model) is assumed to be generalizable to many different conditions, and predict the ability of the test taker to successfully conduct rescues in non-pool domains. The strength of the relationship between the test and real-world performance is an empirical matter. Recognizing the impact of context on test performance may initially look like a serious challenge to the testing enterprise, as score meaning must thereafter be constructed from much more than individual ability. McNamara (1995) referred to this as ‘opening Pandora’s box’, allowing all the plagues of the real world to infect the purity of the link between a score and the mind of the person from whom it was derived. While this may be true in the more radical constructivist treatments of context in language testing, I believe that validity theory is capable of taking complex context into account while maintaining score generalizability for practical decision making purposes. In the remainder of this chapter I first consider three stances towards context: atomism, neobehaviourism, and interactionism. This classification is familiar from other fields of applied linguistics, but in language testing each has distinctive implications. Each is described, and then discussed under two sub-headings of generalizability and provlepsis. Generalizability is concerned with the breadth or scope of score meaning beyond the immediate context of the test. The latter term is taken from the Greek Προβλέψεις, which I use to refer to refer to the precision with which a score user may use the outcome of a test to look into the future and make predictions about the likely performance of the test taker. Is the most appropriate analogy for the test a barometer, or a crystal ball? I conclude by considering how it is possible to take context seriously within a field that by necessity must decontextualize to remain ethical and legal.



Fulcher, NG, Context and Inference in Language Testing, ed. King, J, 'The Dynamic Interplay Between Context and the Language Learner', Palgrave Macmillan, 2015

Author affiliation

/Organisation/COLLEGE OF SOCIAL SCIENCE/School of Education


  • AM (Accepted Manuscript)

Published in



Palgrave Macmillan



Acceptance date


Copyright date


Available date


Publisher version


The file associated with this record is embargoed for 36 months from publication in accordance with the Publisher's archiving policy available at The full text may be available in the publisher links above.


King, J.



Usage metrics

    University of Leicester Publications


    No categories selected


    Ref. manager