Construct validation of tests

Construct validation has recently become the most vital concern among experts in language testing. This piece of work covers test validity in general and types of test validation, with extra focus on construct validity and the construct validity of the C-test as a particular example.

Function and concept of validity

“The purpose of validation in language testing is to ensure the defensibility of fairness of interpretations based on test performance” (MacNamara, 2000: 48). Validity, however, is still a vague concept and suffers from a lack of clear definition of exactly what it is, let alone how it could be established (Underhill, 1987). It is often misunderstood and seen as a property of the test itself, but, as Cronbach (1971) claims, one can only properly validate the interpretation of data collected from a specific administration of a test. This is because the test itself, or its data, may be valid if administered with one procedure in one situation, but not with another procedure in a different situation.

A standard definition of validity is that a test is valid if it can be established that it measures just the ability that it is expected to measure (Hughes, 1996), i.e. what is often called the intended ‘trait’, ‘variable’, or ‘construct’. Now, given that the purpose of designing our MCFGV test items is to test whether or not the testees possess lexical knowledge, the validity question for us is: to what extent do the test items measure vocabulary? However, since language knowledge or abilities are not directly quantifiable, the decision on an individual's ability must depend on indirect information revealed by an observable instrument (e.g. a test). This immediately allows for various sources of invalidity to intervene. In fact, Nevo (1989) states that a review of the literature on language testing suggests that test-makers' assumptions concerning what they are testing often do not coincide with the actual processes undergone by the respondent when taking the test.

TTS as a source of invalidity

There are a number of sources that could weaken the test validity. Palmer and Groot (1981) concluded that the elements affecting test validity include the test itself, the setting in which the test is administered, the inferences intended to be drawn from the test and the examinees’ characteristics. However, we are interested in the characteristics of the examinees, especially their test-taking strategies TTS, as a source of invalidity, as some research suggests that knowledge of TTS tends to earn students higher scores. Although the problem of a possible effect of successful use of test-taking strategies on a test's validity is applicable to all types of tests, the multiple-choice question is particularly susceptible. The multiple-choice test "is the type of exam that rewards 'test-wise' students with the most extra points for their 'test-wiseness'." (Kesselman and Peterson, 1981: 36). When a testee picks an answer in a multiple-choice test, he does not provide evidence of the cognitive processes by which alternatives are selected or rejected. A test-taker may pick an incorrect answer based on correct reasoning, perhaps by thinking of reasons not anticipated by the test maker. On the other hand, a correct answer may be picked by applying certain strategies, rather than language knowledge. This means that independent factors, which are not supposed to be involved in the ability measured, may obstruct or contribute to the information revealed. Hence, we must question the extent of the validity of the scores that such a test gives us.

Cohen (1984) mentioned findings that support this claim. He reported that students were given reading comprehension questions with 4 multiple-choice options, without the reading passage to which they were supposedly related. The random possibility of selecting a correct answer in such a situation is 25%. However, the subjects scored nearly twice that, clearly relying on some shortcoming in the design of the items which allowed knowledge other than that derived from comprehending the text to be used. Another example is that of items involving filling in a gap. Testees may analyse the context of the gap and use TTS to fill it, which may be quite independent of the intended knowledge targeted by the tester. Such methods bypass the knowledge that the test items are intended to measure, and consequently affect the validity of examination scores. Storey (1997) claims that

The idiosyncratic nature of human behaviour means that the amount of context actually utilized in completing a gap may be quite independent of what analysis might indicate is the amount required to complete it. Thus a cloze item might appear to require intersentential reference to complete it. But a subject may in fact use a quite different strategy to close the gap, one which is based on other clues in the immediate context of the gap, on prior, general knowledge or on a whole range of other test-taking strategies or behaviours. (Storey, 1997: 215).

Brozo et al. (1984) and Carter (1986) claim that teachers and college instructors are largely unaware of the extent of flawed construction appearing in teacher-made tests, possibly providing an unfair advantage to the student who has good test-taking strategies. For example, in some tests which involve speed as a factor, test-takers may vary in test performance due to variation in time management ability, rather than varying degrees of linguistic knowledge. Complex test instructions or formats could also be problematic for some students. Furthermore, in test construction, various clues are sometimes included in items, which may lead some test-takers to select the correct answer, even though they do not actually possess the knowledge tested.

Means of test validation

As Cronbach (1980) asserts, "The job of validation is not to support an interpretation, but to find out what might be wrong with it" (cited by Bachman, 1990:257). There are four main types of approaches to demonstrate the validity (or lack of it) of a test. These are:

(a) Face validation, which concerns checking the surface suitability of a test as perceived by those involved in its construction or use. Usual instruments used to check this type are questionnaires and interviews with relevant people. The MCFGV test clearly has face validity to teachers and university instructors in Saudi Arabia, since that format is widely accepted in exams all through school and university. See Chapter III for details.

(b) Content validation, which is the investigation of whether the set of items and tasks selected for a test is representative of the larger set of which the test is assumed to be a sample. An instrument to check this type could be a questionnaire given to relevant experts.

(c) Concurrent validation, which entails comparing a test with another test of the same variable that has already been independently validated. Correlational analysis of data from subjects who take both tests is the usual instrument used to check this type of validity.

(d) Construct validation which concerns researching whether test performances are consistent with the theory behind the construct. A number of methods could be used to check this type, including research directly on the test-taking processes (see II.4.5). The scope for the current exploratory research was to discover the TTS used in tackling MCFGV test items. However, since we used an introspective method to monitor the test-taking process, the data collected might also throw some light on the ‘construct validity’, to which we now turn.

Construct validity

The term ‘construct’ was first introduced formally in 1955 by Cronbach & Meehl, who considered it as a hypothesised trait of people that is supposed to be reflected in test performance (Kunnan, 1995). Although ‘ test construct validity’ is commonly used and investigated by researchers, it is still, as Alderson (2003) puts it, a theory-free concept. Scholfield (2000) also claims that it is misleadingly named, since the notion of a 'construct' is involved in all validity work, not just when one examines 'construct validity'. The concept of ‘construct validity’ is, therefore, not exact in the literature. The traditional view of construct validity is the narrow view that it is just one way of checking a test, as mentioned in the preceding section. However, this type of test validation is now seen by some experts, in a more comprehensive way, as the overarching term for validity checking in general (Messick, 1990), especially since “testing theories have changed and it is validity rather than reliability that is now considered to be of prime importance” (Banerjee and Clapham, 2003: 115). Brown (2002) views it as the extent to which a test measures the psychological construct that it is supposed to measure, identical with our definition of validity in general. Similarly, Clapham (2003) claims that

The term 'construct validity' refers to the overall construct or trait being measured. It is an inclusive term which, according to some testing practitioners, covers all aspects of validity, and is therefore a synonym for 'validity'. If a test is supposed to be testing the construct of listening, it should indeed be testing listening, rather than reading, writing and/or memory.

Given the prevailing ‘modern’ view and following Palmer and Groot (1981), we will, therefore, regard the methods used to check construct validity as “a process of investigating what a test measures” (Palmer and Groot, 1981: 4).

Means of construct validation

Construct validity can be checked by means of one or more of a number of quantitative and qualitative approaches. Following Cronbach, Scholfield (2000) describes three types of conventional quantitative construct validation:

1- Correlational procedure, which concerns the degree to which a supposed test of a particular trait correlates with measures of other relevant variables with which theory would predict it would correlate. E.g. one would assume that a valid vocabulary test should correlate highly positively with a general proficiency test. Hence we will check the correlation of scores on our test for a positive correlation with TOEFL and Nation’s test, see Chapter VI for details.

2- Experimental procedure, where a researcher could give a test, to be validated, to subjects before and after they are exposed to different conditions, which should induce change, or s/he might use repeated measures via pre and post tests with certain pedagogical treatments between, which theory assumes must produce a change in scores. In our case this could have been performed (though we did not do this) by giving the test to be validated to some subjects, then teaching the test words, then giving the test again. Obviously scores should go up, if the test is a valid test of the knowledge of the words.

3- Non-experimental procedure, such as when an EFL test is given to native speakers and EFL learners, where one would expect adult, educated, native speakers to achieve high scores. Jafarpur (1995) used this procedure to test the construct validity of an EFL test and found that native speakers did not achieve perfect scores. Hence, arguably, the test was not valid.

The key point for us, however, is that none of 1, 2 or 3 would provide insight into the test-taker’s mind, to uncover directly what s/he was doing during the test-taking process and so explain why the result could come as it did for Jafarpur, for example.

By contrast, some researchers recommend using qualitative procedures to validate tests. Levenston et al. (n. d.) claim that using discourse analysis to analyse the nature of the interdependence between the omitted lexical item and the given passage in a cloze test, would allow a better understanding of the processes involved in filling in the gaps. They maintain that discourse analysis can uncover “how the reader actually processes the text to arrive at his completion in a cloze test” (Levenston et al., n. d., 203). However, to explore ‘how the test-taker actually processes the test’, analysis of verbal protocols produced by testees would be a direct approach that would be more likely to reflect the actual process, rather than depending on ‘inferring’ it. McDonough (1995) extends this way of validating a test by comparing the learners’ strategies revealed in the test situation with those used for the same intended trait in a non-test situation. Validity is reflected in the degree of similarity or difference in the strategies used in the two situations. In the current study, while we embraced the use of think-aloud verbal protocols to investigate MCFGV tests, we did not perform a parallel study in a non-test situation.

Test-taker’s perspective in validation of tests

Section II.4.2 showed that an important source of test invalidity can be testees’ exploitation of TTS. However, most methods of validation leave us in the dark about TTS. We need, therefore, to use an approach that enables us to gain insight into the testees’ test-taking processes to observe how the test is taken in order to directly assess the actual construct tested.

McDonough (1995) suggests considering the testees’ perspective to address the basic validity questions:

How well does the test actually measure what it is supposed to measure? Second, how good a theory of reading (or whatever) is the tester's construct? In other words, how well does the tester's construct match developments in theorizing about the skill involved outside testing? Recently several authorities have suggested that one way of answering these questions is to look at the strategies of test takers using the think-aloud method (McDonough, 1995: 107).

An example of the relevance of the test-taker’s perspective has been noticed by the present writer, from when he sat a real IELTS test. The introspection of the process gave evidence of invalidity, which think-aloud elicitation by a researcher would be likely to access, whereas other methods of validation could miss this evidence. In some of the test items of the listening comprehension section in that test, each of the four multiple-choice response options was about three quarters of a line in length. Since the time provided for the response was very short, the researcher noticed that answering such items, in fact, needed speed-reading competence as well as listening comprehension ability. In other words, some examinees may have comprehended the given listening material very well, but they may have been unable to give the right responses because the written items, which supposedly tested listening comprehension, actually measured a second skill at the same time.

The importance of assessing the test-takers’ perspective is further supported by the work of many experts (e.g. Hunter et al., 1985; Dolly and Williams, 1986; Kiester and Kiester, 1989; Allan, 1992; Scruggs and Mastropieri, 1995; Yi'an, 1998) who argue that when students are taking a test, any test, they are really being tested on two things: (a) how much they know about a subject, and (b) how adept they are at taking tests. If this is true, then the test may fail to discriminate between the test-takers' ability in the tested area and may also fall short of measuring the extent of the testees' knowledge in the given task (II.4.2). With regard to the current study, test items have presumably been constructed to require particular lexical knowledge to fill in a gap. However, subjects may have used quite independent strategies to make a choice for the gap involving other kinds of knowledge than the one targeted. The informative direct method that could illuminate this dark area, as Clapham (2003) claims, is

to ask test-takers to introspect while they take a test, and to say what they are doing as they do it, so that the test constructors can learn about what the test items are testing, as well as whether the instructions are clear, and so on.

This approach of using introspective protocols generated by test-takers allows for direct exploration of the actual underlying processes of the test and discovery of the influence of TTS on the scores obtained. A comprehensive picture of the advantages and disadvantages of using this method is provided in II.8. Details about other studies that have used it in test situations are in II.9 and II.4.7.

All in all, test construct validity has been the focus of extensive thought, debate and research in language testing; yet, one angle that has been somewhat neglected in this issue is the test-takers' perspective. Without exploring the processes evoked when testees tackle multiple-choice questions, or any type of test format, test theorists are hindered in their attempt to verify and increase validity. Thus, construct validation needs to be extended beyond the test maker's 'expectation' or experts' judgment of the validity of items, and even beyond quantitative analysis, to include getting typical target students to answer the items, tapping their process and analysing their reports, to see what sort of knowledge really is being used to answer the items and, hence, how valid the items are.

Construct validity of the C-test

This form of testing was first introduced in 1981 by Klein-Braley and Raatz and has given rise to much dispute, which makes it a good case to look at in detail as an example of construct validation. The format of “a C-test demands exact word gap-filling by mutilating words in regular ways rather than by omitting words at regular intervals” (McDonough, 1995: 114). The standard procedure is that the second half of every second word is deleted and the first and last sentences are left intact. However, Grotijahn (personal communication, 2003) reports that this format is flexible, so if the proficiency level of the testees is low, the deletion could take place with every third word, or more, and can be performed on the last third of a word. He also states that the mutilation can also be made at the beginning or in the middle of a word if the language of the test has some restrictions on deleting certain parts of the words, e.g. because of a characteristic morphological system of prefixes or suffixes.

Different studies have advanced different arguments about what the C-test measures resulting in considerable dispute among the professionals in language testing and the need for several validation studies. A number of experts maintain that C-tests are a valid measure of general language proficiency (Klein-Braley, 1985; Raatz 1985; Dornyei and Katona, 1992; Klein-Braley, 1997; Eckes and Grotjahn, 2003). The validation of the C-test in these studies depended mainly on the traditional correlational procedure (II.4.3.c) that checked the degree to which testees’ C-test’s scores correlated with their scores on a general language proficiency test that had already been validated. In addition to the claim that this form of testing measures general L2 proficiency, other features are also claimed:

There is no denying that it is an attractive choice for school and department that are responsible for assessing the overall proficiency of their students efficiently and cost-effectively. C-tests are a convenient option in resource poor environments for a number of reasons: they have a relatively simple design, they are low-cost in terms of developing, piloting, or photocopying, they are short and easy to administer and can be marked quickly and efficiently by just a few raters. (Kontra and Kormos, 2003: …)

In contrast, other professionals doubt the claim that the C-test is a valid measure of general language proficiency. Alderson (2002), for example, firmly asserts that

The notion that there is a Holy Grail of language testing, a magic procedure which could produce universally valid measures of language ability, had, I thought, been finally laid to rest… However, I worry that we have not learned from history: we risk reviving alchemists’ claims of universal validity for another methods, this time the C-Test procedure. We should be on our guard against this danger. (Alderson, 2002: 15)

Stemmer (1991) empirically investigated the question: what does the C-test measure? There were three French C-test texts and four methods were used to collect and analyse the data:

1. a questionnaire to collect some background information about the subjects (e.g. age, sex, L1, L2 of parents).

2. discourse analysis of the three texts used in the test to infer the particular type of language knowledge involved in producing answers for the test items.

3. the actual outcome, i.e. the test performance on each item analysed for the numbers of items answered correctly, incorrectly or left blank.

4. think-aloud reports intended to collect data on the test-taking process from 30 subjects. The testees were 17-25 years old, studying French as a foreign language and their mother tongues were 8 to 9 different European languages. Before a testee did the test individually, s/he had listened to a short tape-recorded of a person thinking-aloud while doing a part of a C-test, as a training session for 4. One may argue, however, that this phase might have influenced the data collected, as the subjects may have mimicked that sample when they were tackling the study test; therefore, we did not do this in our study. The think-aloud reports were audio-taped and analysed to gain insight into the subjects’ problem solving behaviour, but the researcher did not report the language in which the think-aloud reports were performed. This could hide a major shortcoming, since it is unlikely the researcher understood all the L1s of the testees, and their responses may have been limited in L2. She also did not report what motivation was of interest to testees for taking the test.

The study revealed that more function words were answered correctly than content words. Since function words contribute less to the meaning than do content words, this finding was claimed to indicate that the C-test operates more on a grammatical level than on advanced reading comprehension ability or general language proficiency. Hence, “the higher level comprehension skills are not involved in C-test solving” (Stemmer, 1991: 329).

Jafarpur (1995) found a similar result. In his study, 20 versions of an English C-test were distributed to 325 Iranian non-native speakers of English and 202 undergraduate English native speakers. Each version was taken by about 10 native and 16 non-native test-takers. A questionnaire was used to collect data about the face validity (II.4.3.a) of the C-test from EFL learner and instructors. Subjects also took a conventional cloze test so that their scores on the C-test and cloze test could be compared using Pearson product-moment correlation coefficients. Moreover, the mean scores of each group were subjected to analysis of variance and t-tests (II.4.5.3). The result indicated that the C-test suffers from a number of problems that basically cast doubt on its validity as a measure of general L2 proficiency. It was found that native speakers did not achieve perfect scores and C-tests did not possess face validity. “To sum up, the results of this investigation indicated that C-testing does not achieve the claim made on its behalf [that it is a valid measure for general language proficiency]” (Jafarpur, 1995: 209).

In a recent unpublished study, Kontra and Kormos (2003) investigated the construct validity of C-tests using think-aloud. The aim was to see the actual activity involved in answering the test items. The participants were 10, 1 male and 9 female, EFL university Hungarian students aged 19-22. Their English proficiency was upper-intermediate and advanced. A short training in thinking aloud was given to each participant, but the researcher did not provide details about the content of this session. The subjects then were asked to take a C-test with consisting of three texts, with 20 gaps each. They were asked to think aloud while they were doing the test. The language in which the participants commented on their thought-processes was up to them, but most of them used Hungarian. They performed the test individually, in the presence of one of the researchers, which could be considered as an unusual situation for taking a test. The verbal data was tape-recorded, transcribed and analysed, but there were no much details about these phases. The researcher conclude that

conclude that in the investigated setting the C-test is a valid measure of a number of components of foreign language competence and that with the appropriate methods in the analysis of the results, it can be used reliably to test the proficiency of upper-intermediate and advanced students of English (Kontra and Kormos, 2003: ...)

All in all, despite its short history, the C-test has received great deal of serious attention regarding its construct validity, about which there is still no consensus. What is important for us is the light the debate throws on the varied ways in which validation can be attempted, as several dissimilar approaches have been used to check the construct validity of the C-test. Since we are concerned with the strategies used in a test situation, the approach of most interest is the introspective method, which we have seen was not always used (Klein-Braley, 1997; Jafar, 1995; Eckes and Grotjahn, 2003). This method of data collection will be discussed in detail in II.8.

Hosted by www.Geocities.ws