Home

Home

Resumé

Ph.D. Studies

STANDARDIZED TESTING: FRIEND OR FOE?

by Diane L. Jackson

A Paper Submitted in Partial Fulfillment Of The Requirements for

ED814 Assessment and Measurement in the 21^st Century

Capella University, March, 2001

TABLE OF CONTENTS

Abstract (page 1)
Introduction (page 2)
Problems with Standardized Testing (page 5)
How Can Standardized Testing Procedures be Improved? (page 18)
Conclusion (page 29)
Appendix A: Research Aids (page 32)
Appendix B: Overview of the ITGS Assessment Process (page 35)
Appendix C: Understanding Education Statistics (page 38)
Appendix D: Assessment and Measurement Associations and/or Groups (page 39)
Appendix E: Notes from Measurement and Assessment in Schools (page 41)
Selected Bibliography (page 59)

INTRODUCTION

Educators have always assessed students. However, as our research knowledge about assessment has increased, our use of the tool has not improved. Always a controversial issue, there are concerns about standardized tests that are not aligned to the standards; high-stake testing; too much testing; a loss of curricular content matter due to the limited nature of the question format, and, ultimately, confused interpretations of the goals of locally and nationally mandated tests.

That Americans frequently test students is not at issue as there appears to be a ongoing trend for both increased frequency and types of standardized testing. According to Phelps (2001), CSTEEP's Madaus has claimed "that American students [are] already the most heavily tested in the world" (online). What appears to have initiated the most recent testing movement was Sputnik in the 1960s. At the time, it was felt that the reason the Russians beat us into space was because their educational system was superior to ours. The media backed the premise and the public began to believe it was true.

Thereafter, followed a period of quick-fix education reforms that has continued to this day. The only constant during this time of internal upheaval was the initiation of the National Assessment of Educational Progress (NAEP) as a general indicator of the overall quality of the nation's education. The 1970s saw the introduction of the minimum competency movement because, as there were few standards to align with testing, it was felt better to acknowledge that students were at least achieving a "minimum standard" as defined by government and testing officials.

During the 1980s, the nation was rocked with the publication of A Nation at Risk, which did little to dispel the growing belief that our nation's schools were in need of a major overhaul. At the time, however, there was no one organization in charge of the public system. Our forefathers had thought it prudent to put control of education into the hands of each state's legislature. It was not until President Bush's presidency that the Department of Education became part of the Cabinet. In fact, it was President Bush and the nation's governors who, in 1990, put forward six national goals that were to be achieved by the year 2000. They included:

By the year 2000, all children in America will start school ready to learn.
By the year 2000, the high school graduation rate will increase to at least 90% from the current rate of 74%.
By the year 2000 , American students will leave grades 4, 8, and 12 having demonstrated competency in challenging subject matter, including English, mathematics, science, history, and geography. In addition, every school in America will insure that all students learn to sue their minds, in order to prepare them for responsible citizenship, further learning, and productive employment in a modern economy.
By the year 2000, American students will be first in the world of mathematics and science achievement.
By the year 2000, every adult American will be literate and will possess the skills necessary to compete in a global economy and to exercise the rights and responsibilities of citizenship.
By the year 2000, every school in America will be free of drugs and violence and will offer a disciplined environment conducive to learning. (Elam, p. 43.)

Although President Clinton, one of the governors present in 1990, continued to accept the statements, and added additional goals for teacher training and parent participation (Elam, p. 47), the year 2000 has come and gone without a resolution to the problems facing our national education system. In fact, a quick perusal of the goals shows more than half of them to be Arcadian at best. There is, in the world, no perfect education system. That the United States has chosen to educate to an equal standard all of its children already sets it apart from many nations where only an elite few may be given the opportunity to finish high school and/or continue on with a tertiary education.

As Gerald Bracey, a research psychologist, points out in his critique of Lawrence Stedman's analysis of the Sandia Report:

Lurking under Stedman's assertions about the low quality of American schools is the assumption that such quality exists somewhere. It does not. At least, not in the best data we have to date. It is fine for us as a nation to work toward the kind of quality that Stedman aspires to, but as we do, we should keep in mind that such an educational system does not exist at present anywhere in the world. When articles contain sentences … about American kids being 'lapped' by those in other countries, or when a president contends that American kids are being 'slam dunked' by those abroad, or even when test scores are presented solely in terms of averages, many feel an irresistible urge to see the competitor nations having educational systems of monolithically high quality. But they do not… (p. 137)

The popularity and continued use of the National Assessment of Educational Progress (NAEP) forced many to realize there should be a connection between the test and the curriculum being taught. Standards began to be established, particularly in maths and science. The mathematics standards proposed by the National Council of Mathematics in the 1980s provided the impetus for states to develop their own modifications of those standards.

The writing of the standards has been fraught with problems for many states, largely because there is disagreement as to actually what should be taught. The Educational Testing Service published an online article stating:

42 states had content standards in [Mathematics] in 1998. Science is second with 41 states and [these] emerged from the work of the National Science Teachers Association, the American Association for the Advancement of Science, and the National Research Council. There are now 40 states with social studies/history standards; English and Language Arts follow, with 37 states having established standards. About half the states now have standards in foreign languages, health, and physical education. (online)

Standardized testing, always a controversial issue, became even more so during the 1990s when constructivists argued that the annual testing ritual was in fact, anathema to real learning. Phelps (2001) writes they "oppose school practices that 'fix' behavior" (p. 15) because the belief was that learning was far more likely to occur when students were actively engaged in tasks and materials so they could make their own connections between topics and subjects. This type of learning, they argued, did not fit in well with pre-established standards and could not be accurately assessed in a multiple-choice environment.

PROBLEMS WITH STANDARDIZED TESTING

In addition to the constructivists, many other criticisms surfaced. With such a complex issue at stake, there can be no completely correct interpretation as to whether or not standardized tests are right or wrong. Although this has become an emotional issue for many of the individuals and organizations involved, a nation's education system is only a reflection of itself. Problems in the society are, of course, mirrored in any national program. As such, both advocates and critics of the standardized testing system will have valid points which will need to be researched and understood if we are, as a nation, to overcome some of the deficiencies and to provide the best education for our students which is, after all, our ultimate goal.

A brief review of the literature provides at least nine problems that seem to be at the heart of those against a reliance on continued standardized testing. These include the following:

Even though the research clearly shows students should not be tested on material that has not been taught, many standardized tests are still not aligned with standards.
States have developed high-stake testing procedures that issue penalties to either schools or individuals themselves based on test results that are not fair.
There is too much testing and teachers often do not receive adequate information that would help in their future teaching planning.
Higher thinking skills are not tested because most standardized formats are limited to multiple choice or other objective type questions.
Since Sputnik, there has been an emphasis on mathematics, science, and reading with the result that only certain subjects are tested.
Universities have not adapted their teacher education programs to provide beginning teachers with an appropriate understanding of assessment and measurement techniques.
The design of standardized tests does not provide representative authentic measurements and is not aligned with current learning theorists e.g. constructivism.
The continued use of standardized tests causes a change in both teaching style and content matter.
The research is not clear whether or not standardized tests are adequate predictors of future learning abilities.

The author of this paper firmly agrees with Worthen, White, Fan, & Sudweeks (1999) who state that "tests and other assessment instruments are essential to the educational process, but only to the extent that they are well designed and appropriately applied by qualified persons" (p. 5). As such, each of the aforementioned points will be briefly discussed before offering remedial suggestions.

Align Tests with Standards.

Although standards are being developed, the problem remains that there is no national standard. Historically, the public education system has been under the control of individual states as well as locally-elected Boards of Education. Each unit was perceived as being unique and, perhaps, better than other units. This has, unfortunately, caused a fractionated system of education that cannot possibly, in its current form, yield a satisfactory interpretation of national standards interpretable in all states by all schools.

Even within the school districts themselves, schools, particularly secondary schools, are isolated entities. This attitude has been pervasive and is easily seen in the unwillingness of many teachers to share material, to open their classroom, or to ask questions.

For many teachers, admitting to needing help is tantamount to saying he/she is incompetent. The teacher must know everything; not to know something is less than ideal and would break down the teacher-as-knowledge-bearer model that has been promoted in many school systems. As Dwight Allen points out in Schools for a New Century, establishing curricular decisions at the school level is both a waste of time and decisions usually result from either individual bias or for political reasons.

It seem ludicrous that the same curriculum, offered to students at approximately the same grade level and with approximately the same learning backgrounds, should be taught for longer or shorter times in different districts for no other reason that to maintain the illusion of local control … [Is] it reasonable for 16,000 different school districts to make separate, largely uninformed decisions based on local opinion? So many people of goodwill … are left to make decisions for which they are inadequately informed, or worse yet, misinformed … that national standards of achievement are sacrificed for the appearance of having locally defined curriculum components. (p. 108)

Certainly, if the curricular decisions are not aligned with any national standards, how can we hope that the students will be tested on material that has actually been taught? According to the Educational Testing Service (ETS), "29 of [the] states also reported in 1997 that their assessments were not yet aligned with standards. So frequently, the system is divided against itself -- new content standards with old tests that do not reflect the new content and curriculum" (online).

High-Stake Testing.

As the ETS writes, "what we want from standardized testing is better information for teachers, administrators, policymakers, and the public. Testing used presently too rarely results in better information to aid instruction and achievement" (online). High-stake testing, however, is becoming the norm for some states and there are certainly notorious examples of how the testing itself is being used as a reform measure perhaps because results are quick and public.

One example of such a test is the Texas Assessment of Academic Skills (TAAS) which, although observers note "a greater focus on academic learning" (Phelps, p. 9), has been strongly criticized by, amongst others, FairTest, that has awarded the test a "rating of 2 on a scale of 1 to 5 with 1 being the worst score possible" (Phelps, p. 8). Although improvement has been noted, FairTest says that these improvements are more likely due to enriched curriculum rather than an adequate assessment strategy. Of particular concern, is the reliance on multiple-choice items and the high stakes graduation test.

Strategies have been projected to 'fix' test results. Anecdotal evidence suggests that:

1. It would appear that Texas, along with some of the other high-stakes testing states, has begun to encourage questionable test-taking practices. For example, certain individuals who might be expected to lower the overall average, based on past performance, would be encouraged to remain home on the exam administration day.

2. Another strategy is to classify weaker students as learning disabled or as not being proficient enough in English to complete the examination. Students classified in such a manner are excused from taking the test which inflates the average score, and contributes toward the erroneous perception of marked improvement.

3. Students can sometimes fight back in unexpected ways with regards to high-stake testing. When the administration refused to consider a request put forth by the senior class, the students then retaliated by making a pact to purposely fail the standardized test offered in the spring, during their last semester. The test, used, in part, to establish budgetary allocations for the following scholastic year had no bearing on individual students. The students did, in fact, fail the test and, the following year, the school was left to figure out how to budget with approximately $60,000 less than the previous year.

Cynically, one might infer from this story that, in the case of high-stakes testing, it is better to ensure testing outcomes affect individual students as well as the school community at large! However, more pertinent is the realization that the students, themselves, question the validity of having to sit for examinations that have no personal relevancy.

In another example (Phelps, 2001), fourteen parents filed suit against Johnson Country in North Carolina because their children were held back based on individual test performance alone. Their argument, of course, was that the tests were not valid for measuring individual performances as they were intended to be used to compare different school districts and schools themselves rather than individuals. In an interesting ruling, the parents lost the case because it was found that the County in question had, actually, given the individuals the entire test rather than selective parts.

The North Carolina tests do match state curriculum standards, however, and cover a representative sample of it. Because the state uses the tests to evaluate districts and schools, individual students usually see only one third of each subject area exam; by sampling this way, the state can cut testing time and costs. Had Johnson Country held students back for poor performance on a test that covered only one third of the curriculum, that would have been unfair. Instead, the district put the three separate pieces of the exam together to form complete exams that covered the entire curriculum. (p. 11)

Infamously, the Virginia State Board of Education successfully established a test for grades 3, 5, 8, and 11 whereby 98% of the schools failed (Bracey, p. 153). The tests were ultra high-stake with schools losing their accreditation for failing and students not being able to graduate. This type of test is exactly what Richard Rothstein of the Economic Policy Institute believes is wrong. He has devised an accountability system that promotes, among other things, an emphasis on improvement "rather than some preset goal…" (Bracey, p. 166).

As with all testing issues, studies have also shown positive results for students taking exit examinations. As Phelps (2001) has found, exit-examinations may lead to increased earning power for students.

…test-taking students earned an average of 3% to 5% more per hour than their counterparts from schools with no minimum competency tests. And the differences were greater for women, with as much as 6% higher earnings for those who had taken the tests. Other evidence of the success of high stakes state testing programs continues to surface. (p. 22)

How Much Testing.

In America, many students are tested every year for various locally and state mandated tests. However, there are some pivotal years, namely, 4^th, 8^th, and 11^th grades where virtually every student is tested. These grades correspond with the primary, middle, and high school traditional breakdowns and are frequently used in international comparisons. Some areas, however, repeatedly offer the same examination to both increase proficiency and identify those students with difficulties. For example, the North Carolina exit examination is initially offered in the student's sophomore year which means they have three chances to pass (Phelps, p. 11). Other states, for example, Virginia, even tests twice during the elementary years (Bracey, p. 153). Since the tests are ordered for various purposes, it is difficult at this point to say whether or not there is too much. What is certain is that the testing could be more efficiently conducted and that international students are not tested as frequently.

Although international students may not be tested as frequently, there is definitely a consequence for failing to achieve an acceptable standard. In England, a critical examination was referred to as the "11 plus" and determined what type of secondary education all publicly educated students would receive. Several Asian countries, including Korea and Japan, have intense examinations offered to students during this time which effectively determines the course of their lives. These cultures are not set up to educate every student to the same standard. It is recognized that certain students will be more intellectually capable than others.

Even though international students are used as comparisons, Freeman (1995) reports that such comparisons are, often, very wrong.

…but the Educational Testing Service warned against using international test results for comparisons because of cultural differences. In many countries, an elite group takes the tests, whereas our test population includes all students in the age group. (p. 2)

Higher Order Thinking Skills.

FairTest often criticizes states on the overabundant use of multiple-choice format questions because, it is argued, these items "demand only factual recall and 'lower order' thinking" (Phelps, p. 18). However, Phelps (2001) states that this is an unfair claim because the quality of a question is not in the answer but in the structure of the question. For example, there are many ways to solve any particular question e.g. calculations, sketches, diagrams, pen/paper, etc. What gets recorded, however, is just a mark that will never reflect the type of thinking that was used.

All the optical scanner will read in the end, however, is a sheet of circles, some filled in with pencil and others not. [The] calculations, sketches, and diagrams the student used to solve the problems are left behind in the test booklet, on scratch paper, or in the student's head. Just because the optical scanner and computer do not see the 'process' evidence of 'higher order' thinking, however, does not mean it did not take place. (p. 19)

Given that standardized tests must include a representative sample of knowledge, including neither the easiest nor the most difficult because of the resulting 'messy' statistics, it does not seem probable that many thought provoking questions could be asked. Bracey (1999) finds that these statements are often used by those who wish to justify the nature of multiple-choice questions when, in reality, they can only occur if the test participant has acquired sufficient knowledge about a topic to warrant an analysis or synthesis-type question.

I recall those kinds of items from my own graduate school days. The stem of the item would cover almost an entire page, and about four questions would be based on that stem. The whole two hour test might contain only six to eight items. To answer correctly, you had to be thoroughly versed in, say, the differences between Hull's and Tolman's learning theories (meaning, of course, that you had to know both theories well in order to derive the differences) or in the assumptions that must be met for certain statistical analyses. (p. 151)

Interestingly, although critics complain about the number and type of objective questions used in a test, the fact remains that a majority of teachers themselves do not necessarily supply any more higher order thinking questions than are provided by the standardized tests. In fact, the author can remember standing in front of the Optical Mark Reader (OMR) with more than 200 test cards and then jumping with glee when, two minutes, later the marking was completed! Faced with a time crunch and large numbers, teachers will opt for the easiest marking possible during a scheduled examination session.

Only Certain Subjects Tested.

Ever since Sputnik, there appears to have been an emphasis on Mathematics, Sciences, and Reading. The Educational Testing Service indicates that this may be due to the fact that, at the time of national panic, college professors "stepped forward to redefine mathematics education, and the rest of the curriculum, creating a new math, inquiry teaching, and many courses strange to the taste of most teachers and parents" (online). International tests like the Third International Mathematics and Science Study (TIMSS) chooses those subjects because the content, particularly in maths and physics, is relatively the same in any country.

Other courses, although important, have not been as quick to adopt standards and are not tested as often. Certainly, this can be a problem if an individual has talents in other areas of the curriculum. For example, students at the Key Renaissance School in Indiana do not have superior results in traditional, standardized test situations. However, the school has been at the forefront of innovative techniques designed to align the student and school more closely together by, among other things, identifying an adult mentor in the community with whom to work co-operatively. Other countries, including the United Kingdom, France, Spain, Germany, Japan, and Korea test for other areas. The interpretations of the results of this type of limited subject choice can result in the misinterpretation of test results for certain school districts.

Teacher Education Programs.

Accountability is not only a critical issue for administrators of secondary and primary schools, but for the nation's tertiary institutions as well. Minimum competency tests have been the norm now for several years in several states; however, the emphasis on "minimum" does little to measure the progress made from one year to the next. Anecdotally, I recall the MCT I took in 1987 in the state of Wisconsin. To this day, I remember leaving the room ashamed I had chosen an occupation that so little valued excellence among its members. The questions I was asked could, in my mind, have been answered by any successful high school graduate and had no bearing on whether or not I would become a successful teacher.

Kleiner (2001) reports that some states are now penalizing education programs whose graduates are not able to receive a passing mark.

In Texas last fall, 35 of 85 education schools fell short of the state rule that 70% of the graduates must pass the certification exam; if these institutions don't meet the standard within three years, they will lose the right to prepare teachers. Both Massachusetts and New York are weighing 80% pass-rate directives. Unless they make significant progress, at lest one third of the 60 programs in Massachusetts and two dozen of the 113 in New York are in danger of missing the mark. (online)

Amazingly, with a few notable exceptions, most university teacher programs have not been re-assessed to better serve a new generation of students. According to Tienken & Wilson (2001) "35 out of 50 states do not require teachers take a course, or demonstrate competency, in the area of assessment" (p. 1). Unless teachers take charge of more areas within their capacity, they will remain as non-professionals in a world where professionalism is highly valued.

Current Learning Theory Controversy.

Jensen (1998) writes the current brain research is indicating just how adaptive our brains are and that one of the problems in traditional classrooms is that we do not allow the adaptability to develop. One of the issues, for him, with most standardized tests is that most of them operate on a singular approach to a single answer because the stem of the question is of a very simple format which means there is no alternative for the brain other than to find the "correct" answer.

Constructivists would agree in the sense that, generally speaking, students do not have any input in terms of what or how a topic is taught. For that matter, the actual subject itself is usually chosen by an adult rather than the student. As brain research and new learning theories are pointing out, the brain is able to group different types of information together in a cohesive fashion and this does not always occur in the traditional simple-to-complex format. In other words, most classroom teachers will select a topic, determine what is important about that topic or choose the appropriate relevant standards, and work out a continuum of what must first be taught at a basic level, at a more intermediate level, and, finally, at an advanced or mastery level. The projects and/or assignments given to the students will reflect this progression from the simple to the complex. However, our brain doesn't necessarily follow this pattern and this is where the constructivists argue that standardized testing is a waste of time and resources.

For constructivists, the world has meaning but this meaning is put together by individuals themselves. It cannot be dictated and it cannot be taught per se. All that can be accomplished is to provide the learning environment conducive to the brain's ability to find and use patterns. Once a pattern is established, the brain can then organize that information for memory and/or recall until such a point when newly acquired information contradicts that pattern and the process must begin anew.

To be perfectly fair, there are not many constructivists who believe in total control by the students. Instead, educators rely on a mixture of cognitive and constructivist theories to provide a balance of learning opportunities. In this way, students are provided with enough material necessary to form a knowledge base and, from there, different patterns can be established based on the individual's own learning temperament.

Tests Cause Change in Teaching Style.

Critics of high-stake standardized tests argue that teachers change both their teaching styles and content material in order to achieve higher marks on the examinations.

Instead of leading to stronger academic achievement, it is said to interfere with good teaching and learning. In this contention, the critics embrace a sort of domino theory. Pressure to produce higher scores leads teachers to focus on material that will be covered by the tests and to exclude everything else. The curriculum is thereby narrowed, which means that some subjects are ignored. [As] a result, tests scores get inflated while real learning suffers. (Phelps, p. 4)

Of all the criticism levelled at standardized teaching, the author finds this particular one difficult to understand. If, in fact, the tests are aligned to the standards, which is happening more frequently, and the standards were used to help design the curriculum then wouldn't the teaching of the course be closely related to the examination? One should think of the whole learning exercise as more of a continuum with standards, curriculum, assessment, and evaluation all inter-related.

However, if as Allen (1992) indicates, "teachers are locked into structures -- staffing arrangements, classroom sizes, curricula expectations -- that make no rational sense" (p. 61), there may be other issues at stake. For example, the standards may not be known to the teacher; the curricula may be dictated by another individual; or the past test results may not have been passed on to the teacher so that course revision could take place. Such a disempowered teacher might, in fact, clutch at any technique to help students succeed irrespective of whether that meant "teaching to the exam".

Adequate Predictors of Future Potential.

Although standardized tests have different purposes, some are more known to the public than others. One of the more well known is the Scholastic Aptitude Test (SAT). Certainly, the traditional argument for students to take this, as well as the PSAT, was to predict their success/non-success at the tertiary level. The SAT has been strongly criticized in the past with allegations of test-bias towards minorities and females.

However, some believe the test-bias is not present and the difference in gender scores can be explained by the selection of courses in high schools. Traditionally, females take fewer maths and science courses than males and, therefore, test lower on the math component of the exam.

The racial bias has been explained away by the Testing Service as due to the inferior nature of many high schools heavily populated by minority groups. As Bracey (1998) points out, this is not unexpected. "Educational reformers talk as if the typical American school is in need of major repair … but the schools that really need it are those with the least resources and the worst social environments" (p. 128).

Recently, however, the validity of the test as a predictive instrument has been questioned. At least one university has done some comparative studies between students with varying SAT scores and has found no significant difference in the grades of those students after one year of university work. However, the author does not know whether the students were enrolled in similar level courses or whether other factors were involved that might have influenced the outcome.

Other studies have found the predictive validity of the SAT to account for only about 6% to 8% of the variation in first year college grades after other issues have been taken into account (Phelps, p. 13). While some college admission specialists say that even 6% to 8% is worthwhile given the cost of maintaining a student at the university, others argue that it isn't worth the trouble and that the admission specialists can get a more accurate picture of the student's potential from other measures including portfolios, essays, recommendations, and interviews.

HOW CAN STANDARDIZED TESTING PROCEDURES BE IMPROVED?

A concerted policy of a redefinition of the goal of testing and assessment, of including testing and measurement courses in education programs and as staff development options, and of initiating a responsible public relations education policy will all need to be implemented if we are to enter the 21^st century with a renewed understanding of the nature of testing and assessment. Some of these programs are already being implemented. Before addressing each of the three issues individually, it would, perhaps, be beneficial to remind ourselves why testing has been advocated by so many and criticized by others.

Robert Linn, in his 1995 lecture at ETS explained why standardized testing has become such a phenomena:

Tests and assessment are relatively inexpensive compared to changes that involve increasing instructional time, reducing class size, attracting more able people to teaching, hiring teacher aides, or enacting programmatic change involving substantial professional development for teachers.

Testing and assessment can be externally mandated far easier than anything that involves change regarding what happens inside the classroom.

Testing and assessment changes can be rapidly implemented, often with the term of elected local officials.

Results are visible and can be reported to the press with immediate ramifications for both positive and negative results. (on-line)

It would appear that the primary motivation behind so much testing has resulted from political concerns. It was inevitable that this should happen because educational policies are directed from a local level, with few exceptions. The federal government has begun to play a larger role as evidenced by the implementation of the 1990 Six National Goals as well as the promotion of the Department of Education to a cabinet-level posting.

However, the concern is still that there is no national curriculum and, as long as local politics are heavily involved, assessment results will continue to be misinterpreted and misrepresented. Allen (1992) promotes the idea of a mixed national and local curriculum to nullify some of these problems. A national curriculum could be set up for a fixed percentage of the curriculum while a smaller percentage would be locally devised dependant on the needs of the area. In this way, the national assessments could still be maintained as a method of comparing different areas of the country while the "local flavor" could still be maintained and controlled.

As Huelskamp (1993) states, the "nation must clarify and agree on the changes needed and must find strong leadership for improvement efforts" (p. 8). Our continued fractious environment, at both federal and local levels, does nothing to facilitate the necessary education reforms. A beginning would, however, have to include the union of federal and local concerns for a combination national/local curriculum. A second task would be the redefinition of testing and assessment goals.

Redefinition of Testing and Assessment Goals.

During the 1980s, the National Assessment Governing Board (Phelps, 2001) bowed to public and media demand and established three performance levels - "basic, proficient, and advanced" - which were then used to show the American public how its children were doing in relation to how they ought to be doing. Although there had initially been strong arguments against releasing state by state NAEP scores, the change was made and, reluctantly, local scores were released.

The pressure from the public, which had for years shown itself to be in favor of more testing as well as high stakes testing, resulted in this adaptation to the NAEP that had first been conducted in 1969 for the express purpose of displaying national results. Annual polls conducted by Phi Delta Kappan/Gallup continually show this to be true as evidenced by the following statements:

(Elam, 1991) Americans want to know what progress is being made toward attaining the national goals for education. By a margin of better than 3-1, they favor preparation and publication of 'public school report cards' for individual schools, for each school district, for each state, and for the nation. Again, every category of respondent supports the idea of report cards on the schools, but parents of school children are particularly enthusiastic. (p. 44)

(Elam, 1994) More than eight in ten (83%) responded that a standardized national curriculum was either very important or quite important; similarly, about seven in ten (73%) thought standardized national exams were either very or quite important (p. 48)

(Elam, 1995) It is no surprise, then, that the vast majority of respondents in this year's poll (87%) favor setting higher standards in the basic subjects than are now required in order to move from grade to grade. Nearly as many (84%) favor setting higher standards for high school graduation. (p. 46)

(Rose, 1997) Testing and its role in school improvement is a frequent subject of debate. Respondents this year were asked their opinion of the level of emphasis on testing in their local public schools. Forty-eight percent responded that the emphasis is about right. The rest were divided between too much and too little. (p. 46)

(Rose, 1998) President Clinton has proposed a voluntary national testing program in which students at the fourth- and eight-grade levels would be tested in order to measure the performance of the nation's public schools. … Seventy one percent say they favor the idea, and the support is uniform across all demographic groups. (p. 51)

It is not, however, just the public who favor national exams, concurrent polls conducted by Phi Delta Kappa regarding teachers' attitudes toward public schools finds teachers do as well.

The number of teachers who believe that high school students should pass a standardized national exam before receiving their diplomas has grown steadily, from 48% in 1984 to 54% in 1989 to 67% in 1996. This year slightly more elementary teachers (70%) than secondary teachers (65%) favor such a graduation exam. (Langdon, p. 248)

The fact remains, however, that while there is popular support for a national curriculum and national testing, that testing must be fair. As Phelps (2001) states, "the fact that tests and test results can be misused is beyond dispute … there is also no denying that tests are imperfect measurement devices" (p. 27); however, that does not mean they should be abandoned. What is important to keep in mind is that the test's goals should be clear and unequivocal. For example, national tests that compare different states are validated for that purpose and local district and/or individual tests should be validated for those purposes.

In general, the "measurement of achievement gets more difficult the smaller the unit of analysis. It is easier to measure at the district level than at the school level and easier at the school level than at the classroom level" (Bracey, p. 166). There are several strategies employed by various states, trying to reverse the trend of misinterpretation. One such state is South Carolina.

In an effort to distinguish between what students have learned in school versus what they know, South Carolina initiated the 1984 Education Improvement Act. The Act set up procedures whereby the state's schools were divided into five groups of similar socioeconomic backgrounds. They then set about to compare the schools within each one of those groups on an annual basis after first establishing baseline records (ETS, online). By organizing a total record keeping program, the students' progress could be documented from year to year and important instructional strategies applied where needed. This, of course, is only one example of where states have initiated responsible policies designed not to send panic waves through the general public but to properly utilize the assessments to help provide a better education.

Testing and Measurement in Education Programs.

The ETS fears that teachers are not well prepared to conduct quality assessments.

This position will leave a lot of people concerned that while testing and grading is left up to the teachers, they have not been well prepared to conduct quality assessments. They are taught little about day to day classroom assessment approaches in school. Nor is much professional development offered. Assessment is part of teaching instruction, and teachers must learn to adequately assess students. Given continued emphasis on standardized testing to hold teachers and schools accountable, the alternative of equipping teachers to do their jobs will continue to be neglected. Teachers and teaching need help. We can have external verification of how well the students in a class or school are doing through sample based standardized assessment that are properly designed and aligned. (online)

Although ETS does not sound particularly hopeful, Kleiner (2001) indicates that, in fact, the nation's universities are now realizing the importance of providing a more professional generation of teachers. At this point, we must stop blaming everyone else and decide what each one of us can do to improve the situation.

Some universities have been stirred to action by increasing legislation in states that is demanding accountability at the tertiary level as well. This legislation is based on the number of graduates passing certification exams, the establishment of a National Certification Program, and the continuance of a school's accreditation. If the legislation did not frighten them, surely the spectre of last October's testing debacle in Massachusetts made it clear that if they did not become accountable, their programs might be closed.

Last October, when nearly half of their peers failed the Massachusetts teacher certification tests, 77% of Boston University's education graduates got passing grades. Not bad, compared with the 28.6% who passed from Framingham State College, or the 25% from the University of Massachusetts, Boston. But not acceptable, says BU Chancellor John Silber, who is also chairman of the state Board of Education. '[They] don't know grammar, they have poor vocabulary,' he fumes. 'If they can't spell, how are they going to correct the spelling of children?' Silber has vowed to shut down BU's education department unless its pass rate improves to 90%. (Kleiner, p. 1)

As if it wasn't enough that the number of students who successfully passed the examination was alarmingly low, the test developer, National Evaluation System, has refused to supply any reliability or validity data. Boston College researchers, however, estimated the reliability to be extremely low and are concerned as to why the data has not been released.

Boston College researchers estimated the reliability of the tests to be … .27 for reading and .36 for writing. Even when the researchers removed some outlier scores, they were unable to get the coefficients above .50. … when we're talking reliability, we're aiming at .90 or better. (Bracey, p. 159)

Although no validity data is available, Worthen, et al. (1999) would be equally concerned about the estimated low reliability scores. They recommend at least a reliability coefficient of .80 or higher for teacher-made tests using in critical decisions about individual students and state that standardized achievement or aptitude tests should be around .90 or higher. What is left unanswered, of course, is whether the students did poorly because it was a badly designed test or just didn't know the material.

In their re-design of teacher education programs, consideration of the use of technology and a look at how international schools handle standardized assessment would be useful. The 32^nd Phi Delta Kappa/Gallup Poll reported that the public believes technology has improved education (69%) and that more should be invested in technology (82%). Of course, investing in technology not only means the hardware but the training and software necessary to create a viable program.

There are numerous examples of successful technology programs (Foshay, 2000, PLATO); however, educators should also be looking to become more technically oriented with regards to assessment. "The focus of professional development should be on teaching and learning strategies that make a difference in daily practice -- on activities that translate into stronger student performance" (McKenzi, online). See Appendix A for a listing of web resources that might provide the basics for such a study.

Difference in daily practice includes not only the general curriculum material but an improved testing medium as well. "For educators, the most relevant computer applications in measurement are (1) using computers as an alternative medium for test administration and (2) using computers for designing and delivering individually tailored testing" (Worthen et al., p. 22). To date, most educators are satisfied using computers to do the work they used to do by hand but, they're missing the point. Computers can be used for so much more. Applied use of computers in a planned assessment program should enable teachers to become more proficient and professional. Otherwise, as Larry Cuban questions in McKenzi (2001), "'How can it be .. that so much school reform has taken place over the last century, yet schooling appears pretty much the same as it's always been?'" (online)

A benefit of using a computerized test is the ability to individualize each student's test based on their answers. In this way, the student remains challenged with questions that are neither too difficult nor too easy. One school district that has implemented computerized testing is South Madison, a small community located near Indianapolis. Coyle (2001) finds the tests have created a new atmosphere in the school with teachers becoming more flexible to help students and to look more carefully at each student's performance.

The test data also allows for more meaningful accountability. … Our new computerized tests provide a more accurate picture. We can answer questions with information about actual student growth, not just the snapshot information reflected in the state test results. (online)

Surely South Madison has adopted, of its own initiative, what universities should be teaching and what other school districts should be implementing. It is clear that at least for that school district, the state mandated tests have their role but they will not be used as a basis for decisions about individuals. They will have established their own norms and will be able to improve learning amongst their own students through the use of adaptive computer testing.

In addition to the implementation of computerized testing in a planned assessment program, a look at international curricula might provide alternative methods for the standard multiple-choice format. One of those alternatives is the International Baccalaureate (IB) which is a two year program offered to juniors and seniors. A complete program, the IB's largest growth is in America (where it is wrongly compared to the Advanced Placement program) and Canada. Although certainly the numbers of students doing the IB cannot in any way compare to the number of public school students in America, their examination format would certainly be workable.

As there are many courses offered and the assessment procedures will differ in detail, the author will present information about one of the courses, Information Technology in a Global Society (ITGS), which is basically a computer course presented from a humanities point of view. The course is set up with a specific list of objectives and syllabus content. There are suggestions for course textbooks, etc. but nothing is mandated. The final assessment consists of two parts: an internal and external component. See Appendix B for an overview of the assessment process in this course.

This is just one example of the type of 'standardized' examination that could be provided to American students. It offers a combination of internal and external marking, is moderated at least two times, and results in relevant feedback to the teacher in time for the beginning of the next year.

NOTE: Appendices C, D, and E are located at the end of this paper that include web references for associations concerned with testing, several texts useful for understanding educational statistics, and notes from the textbook, Measurement and Assessment in Schools, that, taken together, will form the basis for a future web site.

Responsible Public Relations Education Policy.

In the absence of a national public relations education policy to help educate the public (and educators), the author has found that one organization and one individual have consistently tried to counteract the negative impact of bad educational news. The organization, the Phi Delta Kappa Foundation, has conducted public opinion polls, in collaboration with Gallup, for the last 33 years. Each year, the report is released in September and contains both the questions and interpretations of the statistics. Often, percentages from earlier polls are quoted to show general trends. These statistics are then referred to in many articles.

The report also includes information as to how the poll was conducted and, sometime after 1991 but definitely by 1994, began reporting more statistical information. For example, in 1994, a short explanation of cross comparisons is present along with a fairly detailed description of sampling tolerances. Perhaps this was included because the general public was not aware that a sample of 1500 set up as an "unclustered, directory-assisted, random-digit telephone sample, based on a proportionate stratified sampling design" (Elam, p. 56) could accurately depict the opinions of the whole nation.

The individual whose research, comments, stories, and editorials supporting public school education abound in every media, is G.W. Bracey, an educational psychologist. His career changed in 1990, when he read Richard Cohen's column, "Johnny's Miserable SATs" in the Denver Post (Bracey, 2000, p. 133). At that time, he began his own statistical evaluation of some SAT statistics, sent them off to Education Week, and realized educational statistics were being misrepresented and misinterpreted across the nation by educators, politicians, and columnists. His major paper is published in October every year, after the PDK/Gallup Poll is released.

One of the areas he criticizes is the comparison between American students and international students. Even one of the national goals set up by President Bush alluded to this comparison: By the year 2000, American students will be first in the world of mathematics and science achievement. The author, who was educated in Taiwan, Okinawa, and Panama and has worked in schools in Costa Rica, Nicaragua, Sri Lanka, Lesotho, and, currently Saudi Arabia, would agree that these comparisons are, in many ways, so dissimilar as to be unintelligible. Primarily, the factors that are so often incorporated into our national statistics to ensure that there are no socioeconomic conditions skewing the results, are rarely applied in international comparisons.

In fact, the two subjects that are most prone to misinterpretations and always compared are mathematics and science. In America, they are both taught in discrete units; however, overseas, many schools teach the subjects in an integrated manner, particularly mathematics. For example, instead of learning algebra, geometry, trigonometry, and calculus, all four aspects are taught together which, interestingly, puts American-taught students at a disadvantage. Even an international student, at a grade equivalent of sophomore status will have had some aspects of all four areas.

Science, as well, can be taught in an integrated fashion during the early years in high school but, by the time the student is the equivalent of a junior or senior, specializations will be taught. Even though a student may not take biology or chemistry during his/her last years of high school, a broad overview may already have been presented in the earlier years. Bracey (2000) presents a far more accurate interpretation of the results of the Third International Mathematical and Science Study (TIMSS) than is found in most publications. Hopefully, Bracey and Phi Delta Kappa will continue to promote the public education cause until the rest of public becomes more statistic literate and the teachers are educated to a better standard.

CONCLUSION

Our forefathers thought it prudent to put control of education into the hands of each state's legislature. It was not until President Bush's presidency that the Department of Education became part of the Cabinet. In fact, it was President Bush and the nation's governors who, in 1990, put forward six national goals that were to be achieved by the year 2000. They included:

By the year 2000, all children in America will start school ready to learn.
By the year 2000, the high school graduation rate will increase to at least 90% from the current rate of 74%.
By the year 2000 , American students will leave grades 4, 8, and 12 having demonstrated competency in challenging subject matter, including English, mathematics, science, history, and geography. In addition, every school in America will insure that all students learn to sue their minds, in order to prepare them for responsible citizenship, further learning, and productive employment in a modern economy.
By the year 2000, American students will be first in the world of mathematics and science achievement.
By the year 2000, every adult American will be literate and will possess the skills necessary to compete in a global economy and to exercise the rights and responsibilities of citizenship.
By the year 2000, every school in America will be free of drugs and violence and will offer a disciplined environment conducive to learning. (Elam, p. 43.)

Although the year 2000 has come and gone, we still have not been able to address all the problems with the public education system. A brief review of the literature provides at least nine problems that seem to be contentious issues. These include the following:

Even though the research clearly shows students should not be tested on material that has not been taught, many standardized tests are still not aligned with standards.
States have developed high-stake testing procedures that issue penalties to either schools or individuals themselves based on test results that are not fair.
There is too much testing and teachers often do not receive adequate information that would help in their future teaching planning.
Higher thinking skills are not tested because most standardized formats are limited to multiple choice or other objective type questions.
Since Sputnik, there has been an emphasis on mathematics, science, and reading with the result that only certain subjects are tested.
Universities have not adapted their teacher education programs to provide beginning teachers with an appropriate understanding of assessment and measurement techniques.
The design of standardized tests does not provide representative authentic measurements and is not aligned with current learning theorists e.g. constructivism.
The continued use of standardized tests causes a change in both teaching style and content matter.
The research is not clear whether or not standardized tests are adequate predictors of future learning abilities.

A concerted policy of including a redefinition of testing and assessment goals, a re-organization of teacher education programs to include testing and measurement, and a responsible public relations education policy will need to be implemented if we are to continue into the 21^st century with a renewed understanding of the nature of testing and assessment. This will lead to a re-establishment of education as a profession rather than just a form of employment. Any nation’s education system is but a reflection of itself and, given that the commercial, public media will continue to promote the poor image of the public education system, it remains for the educators to turn the tide and to focus once again on the original goal, that of the establishment of a viable system to educate all of the country’s youth. We should not be supporting programs that benefit just the rich, the intelligent, and the native speakers but also those of limited language ability, those with learning disabilities and those whose lives outside the school environment preclude a successful educational experience.

APPENDIX A

RESEARCH AIDS

Note: The following resources were obtained during an Introduction to Research seminar sponsored by Capella University, December, 2000.

Research Design:

Methodologist's Toolchest: methodologist's Toolchest is a tool for developing research proposals and research designs. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13042

Data Gathering:

Sphinx Survey: A software program that represents the one stop shop for creating, developing, administering, and analyzing your survey or research questionnaire. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13040

Best: Consists of a module for data collection including an ability to record start and stop times during behavioral observation. http://www.skware.com/

Ronin Results for Research: The main function of this program is telephone interviewing and large quantitative research projects. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13038

MediaLab: You can create sophisticated questionnaires that can run graphics, sound and video files as well as Powerpoint presentations all as part of the questionnaire. http://www.empirisoft.com/medialab/index.htm

Qualitative Data Coding and Analysis

QSR NVivo 1.2: this program allows for the coding and analysis of qualitative data such as text, graphics, video and audio. http://www.sagepubl.com/Shopping/Software.asp?t=ldesc&id=13035

WinMax: Stands out in its unique interface system. codes are easily stored in hierarchical system similar to Windows Explorer. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13039

Atlas.ti: A coding program that allows for hyperlinking to text, other files, graphic files, audio files and multimedia. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13022

Ethnograph: One of the first programs to pioneer computer assisted qualitative data analysis. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13024

HyperResearch: A simple coding and analysis program like WinMax or Atlas.ti but it also handles graphic files, audio files and multimedia. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13025

Code-A-Text: Allows coding of text-based and multimedia materials. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13032

Decision Explorer: Basically a tool that allows you to take brainstorming sessions and map the concepts and ideas presented. http://www.sagepub.com/Shopping/software.asp?t=ldesc&id=13033

Diction 5.0: Allows you to quantitatively compare and process text. http://www.sagepub.com/Shopping/Software.asp?t=ldesc&id=13034

Quantitative and Statistical Analysis:

GBStat: A statistical analysis program that is powerful and fairly simple to use. http://www.gbstat.com/

Minitab: A statistical and data analysis package that is also fairly simple to use and encompasses many of the features of more expensive statistical programs. http://www.minitab.com/

SPSS: A product that could be considered the "standard of the industry". http://www.spss.com/

Microsoft Excel: The "standard of the industry" for spreadsheets can also do many statistical analysis functions. http://www.microsoft.com/office/excel/default.htm

Report and Presentation Tools: Reference Manager, Format Ease, End note and Pro Cite are all bibliography and reference tools.

Reference Manager: http://.risinc.com/
Format Ease: http://www.formatease.com/
Endnote: http://www.niles.com/
Pro Cite: http://www.isiresearchsoft.com/
L&H Voice Express: This is a dictation and transcription program. http://www.lhs.com/voicexpress/
Hypercam: Real-time screen capture. http://www.hyperionics.com/
Hypersnap DX: Screenshot shareware program. http://www.hyperionics.com/

APPENDIX B

OVERVIEW OF THE ITGS ASSESSMENT PROCESS

Internal Component:

Portfolio. The student will have written a number of papers (700-1000 words) throughout the two years. These papers will follow a particular format and will require external resources but will all involve an aspect of computers and society. Each paper will be dated, assessed, and kept for two years. At the end of two years, the student will choose the four papers that he/she thinks best represents his improvement during the course. These papers will then be collected, the assessment scores and the teacher will record the score, using pre-set criteria, that most accurately reflects the performance of the student at that moment in time. So it is not an average score but a score that shows his/her present attainment.

These papers will then be sent to an Assistant Examiner who has no connection with the school. The Assistant Examiner will examine the papers for any inconsistencies and write up a summary as to whether or not the teacher accurately marked the papers. This person, who will have been in contact with an Examiner, will then send the papers on to the Examiner. The Examiner will then meet with other Examiners from around the world. The grade boundaries for that year will be set using the pre-set criteria but every year there is a slight fluctuation, either up or down, in the particular grade boundaries.

The Examiner will then write a general report for all schools involved in the program with suggested areas of improvement, etc. A school reporting for the first time will receive an individual report specifically listing areas that have been satisfactorily addressed as well as areas for improvement.

The Portfolio counts as 20% of the final mark.

Project. The Project will be of a practical nature and will be composed of three parts: a journal indicating plans, problems, and progress; a report (2000 words) that will follow a particular format; and the actual product itself. The student may think of any project as long as there is a beneficial component for society (as defined by the student). Past projects have involved web-site design, pamphlets and brochures, graphic design, etc. The product must be able to be evaluated by others. Once the Project is completed, it will be assessed and sent to an Assistant Examiner where the process listed above for the Portfolios will be followed. The Project counts as 30% of the final mark.

External Component: These are the written examination papers. Around the world, they are conducted at a certain time, under very stringent guidelines, including a 24-hour ban on any individual looking at the tests to preclude anyone who might want to advise a student in another time zone about the test! These exams count as 50% of the total mark for the course. The papers are not marked nor seen by the teacher. After completing the examinations, they are immediately sent off to an Assistant Examiner where the same procedure as that of the Portfolios and Projects is followed. The difference with these examinations is that there is a proposed mark scheme the Assistant Examiner has received; however, he/she is in constant contact with the Examiner for any changes that have resulted from the students' answers.

Paper 1. This is a multiple-choice examination that lasts for 1 hour. Questions are aligned to the syllabus and are changed every year.

Paper 2. This is a guided answer essay format that last 1 1/2 hours. There are two parts to the examination: Part A and Part B. In Part A, one guided essay-type question must be answered. In Part B, there are four choices of guided essay-type questions. Three essays must be answered. Questions are changed every year.

APPENDIX C

UNDERSTANDING EDUCATION STATISTICS

1. Understanding Education Statistics: It’s Easier (and more important) Than You Think published by Educational Research Service (ERS), Arlington, Virginia

2. US Department of Education: CD-ROM: 1996 Digest of Education Statistics, 1996 Condition of Education, parts of the 1995, 1994, 1993, and 1992 The Condition of Education, Projections of Education Statistics to 2006, 1996 Youth Indicators, Historical Trends: State Education Facts 1969 to 1989, State Comparisons of Education Statistics: 1969-1970 to 1993-1994; 120 Years of American Education: A Statistical Portrait, and Education in States and Nation: Indicators Comparing US States with Other Industrialized Nations in 1991 – name of CD-ROM is called: Education Statistics on Disk, 1996 Edition – phone 800-424-1616

3. John F. (Jack) Jennings: Center on Education Policy. Pamphlets: http://www.ctredpol.org

4. Put to the Test: An Educator’s and Consumer’s Guide to Standardized Testing by Gerald W. Bracey.

APPENDIX D

ASSESSMENT AND MEASUREMENT ASSOCIATIONS AND/OR GROUPS

Cato Institute: www.cato.org
Center for Applied Research and Educational Improvement - http://carei.coled.umn.edu
Center for Leadership in School Reform - http://www.clsr.org or http://www.ncrel.org
Center for Research on Evaluation, Standards, and Student Testing (CRESST) - http://www.rand.org
Center for Research on Students Placed at Risk (CRESPAR) - http://www.csos.jhu.edu
Center for the Study of Testing, Evaluation, and Educational Policy (CSTEEP) - http://www.csteep.bc.edu
Center on Educational Policy - http://www.ctredpol.org
Community Update: http://www.ed.gov/G2K/community
Corporation for Research in Educational Computing (CREN) - http://www.cren.net
Economic and Social Research Council (ESRC) - http://www.esrc.ack.uk
Education and Disinformation Detection and Reporting Agency: http://www.america-tomorrow.com/bracey
Education Commission of the States (ECS) - http://www.leg.state.nv.us
Education Policy Institute - http://www.educationpolicy.org
Education Statistics Services Institute - http://nces.ed.gov
Education Testing Service (ETS) - http://www.ets.org
Educational Research Service: http://www.ers.org
Higher Education Funding Council for England (HEFCE) - http://www.hefce.ac.uk
Masie Center – learning and technology research group - http://www.masie.com
Mathematics Equal Opportunity: http://www.ed.gov/pub/math
National Assessment of Educational Progress (NAEP) - http://nces.ed.gov
National Center for Education Statistics (NCES) - http://www.educationplanet.com
National Center for Fair and Open Testing (FairTest) - http://www.fairtest.org
Northwest Evaluation Association (NWEA) - http://www.nwea.org
Northwest Regional Educational Laboratory - http://www.nwrel.org
Organization for Economic Co-operation and Development (OECD) – Education at a Glance - http://www.oecdwash.org
Phi Delta Kappa’s Center for Evaluation, Development, and Research in Bloomington, Indiana - http://www.pdkintl.org
Prichard Committee for Academic Excellence, Lexington, KY - http://www.prichardcommittee.org
Rand Corporation - http://www.rand.org
Richard Rothstein of the Economic Policy Institute - http://www.brook.edu
Sandia Report: Analysis - http://uwm.edu/Dept/CERAI
Sandia Report: Robert Huelskamp - http://www.teacherevaluation.net/Essay.polit.html or http://www.pdkintl.org/edres/resbul12.htm
Second International Assessment of Educational Progress (IAEP-2) - http://www.ericae.net
Sloan Center for Asynchronous Learning Environments – University of Illinois (Urbana-Champaign) - http://sloan.ece.uiuc.edu
Technology Counts '99 - http://www.edweek.org
The Edison Project by Whittle Communications, Inc. - http://www.aft.org
The Third International Mathematics and Science Study (TIMSS) - http://timss.bc.edu
Western Co-operative for Educational Telecommunication - http://www.wiche.edu

APPENDIX E

NOTES FROM Measurement and Assessment in Schools.

(The following information has been included to act as a basic assessment guide for educators.)

ACQUIRING THE BASIC TOOLS OF INTERPRETATION.

ORGANIZING MEASUREMENT DATA

1. Frequency Distribution: To construct a frequency distribution by hand, list the possible scores in order and then make a tally each time a particular score occurs. The resulting display shows the frequency with which scores are distributed across the possible range – hence the name frequency distribution. In situations where there are too many discrete score points to display, scores can also be grouped into score ranges, and the frequency within each score range can be tallied and displayed. Alternatively, the information can be represented graphically which is then a frequency polygon where each of the possible scores on the test is represented along the horizontal axis. For each score, the height of the dot above the horizontal baseline corresponds to the frequency with which that score occurred and the dots are all connected by a line. Another way of representing the same information is a histogram or bar graph. In this display, values that fall within a certain range are shown in a bar whose height indicates the number of scores in that range. The resulting histogram, or bar graph, is particularly useful when the range of possible scores is large. In such cases, groups of scores can be combined so that the resulting graph is more readable. … Generally, a frequency polygon is preferred for depicting continuous data (that is, data the occur along some unbroken continuum, such as test scores) and a histogram or bar graph is better for depicting non-continuous data (for example, type of automobile – Ford, Chevrolet, Volvo, or Porsche) that cannot be ordered along a continuum. The most important consideration, however, is whether the display communicates effectively.

THE NORMAL CURVE AND OTHER DISTRIBUTIONS. (SCAN PAGE 63)

1. Bimodal Distributions. The mode is the value within a distribution that occurs most frequently. Because frequency of occurrence is indicated by the height of the curve, the mode occurs at the highest part of the curve. … bimodal means two modes … because most human characteristics approximate a normal distribution, you should be curious and perhaps even cautious when you encounter a bimodal distribution. Generally speaking, a bimodal distribution is more likely to occur when the number of data points included in the sample is small. Given a reasonably large sample, if you encounter a bimodal distribution, it may indicate some problems in how the variable is defined or measured, or it may simply indicate that your sample consists of two distinctively different subgroups with respect to that variable.

2. Skewed Distributions. The most frequent way in which data depart from normality is when a distribution is skewed. A skewed distribution has the majority of the data points clustered at one end. Distributions may be positively skewed where most of the scores are bunched toward the low end with a long tail pointing toward the high/positive end or negatively skewed where most of the scores are bunched toward the high end with a long tail pointing toward the low/negative end of the distribution. Recognizing seriously skewed distributions is important in measurement. For example, when a distribution is positively skewed, it shows that most of the group scored poorly but a few students did well. Tests that result in a positively skewed distributions are said to have a floor effect because the majority of the scores are found to be at the floor or the bottom of the distribution. A negatively skewed distribution provides a similar warning … this is called a ceiling effect suggesting that the test may have been too easy.

3. Norm-referenced vs criterion-referenced measurements. In norm-referenced measurement, we want to know where a student scores in relation to other students. The group of test scores is called the norm group. Norm-referenced measurement involves giving meaning to an individual’s test score by comparing it to the scores of others taking the same test. With criterion-referenced measurement, on the other hand, we are interested in how a student’s test performance compares to some absolute standard without comparison to others’ performance. Criterion-referenced measures are typically used to determine whether students have ‘mastered’ specific instructional content or to describe students’ progress through well-defined curricula. Norm-referenced tests that exhibit floor or ceiling effects are usually not an accurate measure of students’ ability. A… With most criterion-referenced tests, however, you should expect a ceiling effect if the content has been effectively taught. The presence of a floor effect suggests that many of the test items are too difficult since the scores do not differentiate between the average and the least capable students because most of the students do poorly.

MEASURES OF CENTRAL TENDENCY

1. Mean. The most frequently used. Average score. Arithmetic average: the sum of all scores divided by the number of scores. This can be greatly influenced by extreme scores, especially if you are dealing with a small group, whereas the median and mode are not. In general, the mean is the most stable measure of central tendency. In other words, if you were to measure the same characteristic among a group of individuals on two different occasions, the two mean scores would tend to differ less than the two medians or the two modes.

2. Median. Distribution is the value in the middle when all scores are ordered from lowest to highest. When there is an even number of scores and no one score is in the middle, the median is the arithmetic average of the two middle scores. If you want to minimize the influence of extreme scores, the median provides the best indicator of central tendency. We mean that, if you have to use only one number to describe a distribution, the median will be closer to the truth for more people. For example, if a test is very easy, students’ scores will cluster at the top of the distribution. In such cases, the median is a better indicator of average achievement than the mean.

3. Mode. The value that occurs the most frequently. The mode tends to be very unstable when the sample size is small. Generally, where scores exist along a continuum or range, both the median and mean provide a better indication of the average.

MEASURES OF DISPERSION. Although a well-chosen measure of central tendency is informative, it generally does not tell us nearly enough about the distribution of scores. Whereas measures of central tendency provide information about a test’s overall difficulty, measures of dispersion (often called measures of variability) provide information about differences among scores. In other words, measures of dispersion describe distances among scores within a group. The larger the value of a measure of dispersion, the greater the variability or heterogeneity among the scores.

1. Range. Simply the difference between the highest and lowest scores. Although easy to compute, it has one serious deficiency, it is determined only by two extreme scores and it can be dramatically altered by adding or dropping just one of those extreme scores. Such instability means it can be very misleading to use the range as a basis of comparing dispersions across two or more groups.

2. Standard Deviation. The most widely used measure of dispersion is standard deviation which is symbolized by s, SD or ơ. The SD provides a number that indicates how far each score typically deviates from the mean. Because all scores in the group contribute to the computation of this number, the standard deviation is less influenced by extreme scores than the range. In every normal distribution, the standard deviation can be used to indicate the extent to which scores vary or are dispersed around the mean. By definition, for all normal curves, 34.13% of the scores fall in the part of the curve that is between the mean and one standard deviation from the mean. Thus, 68.26% of the scores in a normal distribution fall between -1.0 standard deviation and +1.0 standard deviation. Most IQ tests are developed to have a mean of 100 and a standard deviation of 115. Hence approximately 68% of the population has IQ scores between 85 and 115. The SAT has a mean of approximately 500 with a standard deviation of 100. Based on this information, we know that about 68% of those who take the SAT will score between 400 and 600 and about 16% will score above 600 or below 400.

3. Variance. A measure related to Standard Deviation.

Application Problem 3: The scores of a norm-referenced standardized math test are normally distributed with a mean of 67 and a standard deviation of 6.0. Approximately what percentage of a group of 1000 examinees (assuming that the group is comparable to the norm group) would you expect to score between 67 and 73?

Answer: 34%

Application Problem 4: Horatio is enrolled in a chemistry class with 50 students. He has received the following scores on the various assignments for the first half of the year. On which test or assignment did he do the best relative to the other members of the class? On which did he do the worst?

CLASS SCORES HORATIO'S SCORES

	Lowest	Highest	Mean	SD	Actual Points	% Correct
Assignment 1	22	48	35	5.0	30	60
Assignment 2	8	23	14	3.0	20	80
Assignment 3	48	75	60	6.0	66	83
Midterm	56	91	70	7.0	70	70
Final	128	174	150	10.0	130	75

Answer: Since the question asks how Horatio did in relation to the other class members, the key to finding the correct answer is to convert all the scores to standard deviation units with respect to the distribution of scores in the class. Under the given assumption that the scores are approximately normally distributed, Horatio's relative standing on the five occasions may be obtained by using the standard deviation units of his scores and comparing them with a normal distribution. The following are obtained:

Assignment 1: -1.0 16% percentile

Assignment 2: +2.0 98% percentile

Assignment 3: +1.0 84% percentile

Midterm: 0 50% percentile

Final: -2.0 2% percentile

THE CORRELATION COEFICIENT: A MEASURE OF ASSOCIATION. Correlation describes how scores in one distribution relate to scores in another, or how one variable is related to, or associated with, another. So questions like: do people who are good at math tend to have a lower level of musical ability? If you have good grades in high school, will you get good grades in college?

1. Definitions and Examples. Two variables are correlated if they are related or tend to "go together". For example, tall people tend to weigh more than short people and so height and weight are correlated.

a. A correlation coefficient has two components: direction and magnitude. With respect to direction, correlations can be either positive or negative. When high scores on one variable tend to be associated with high scores on another variable, and low scores on one variable tend to be associated with law scores on another variable, the correlation is positive. For example, the correlation between height and weight is positive because, in general, tall persons tend to weigh more and short persons tend to weigh less.

b. A correlation also has magnitude to indicate the strength of relationship. Correlation coefficients range from -1.0 to +1.0. The closer a correlation is to zero, the less two variables are related. A high correlation can be either a high negative or a high positive correlation, for example, a negative -.80 indicates the same strength of relationship between two variables as a positive +.80 even though the two correlation coefficients differ in terms of their direction. A correlation of -1.0 or +1.0 is also referred to as perfect correlation because you can perfectly predict a person's score on one variable by knowing her score on the other. High correlations are generally considered to have absolute values of .70 or higher (a well-constructed achievement test should correlate around .90 or above); moderate correlations are in the range of .40 to .70 (aptitude test scores generally correlate with later grades about .50); and low correlations are less than .30 (correlation between manual dexterity and scores on a test of reading comprehension would probably be close to 0.0).

2. Interpreting Correlation Coefficients. These must be interpreted carefully because a few abnormal data ranges can greatly change the correlation coefficient. Single instances of data that cause the correlation coefficient to change are known as outliers. In almost all decisions concerning correlation coefficients, it is worthwhile to examine a scatterplot to make sure that the correlation is not being unduly influenced by one or two aberrant scores. As a data set becomes larger, however, the influence of outliers becomes smaller or negligible.

Application Problem 5. Based on a large sample of students, the correlation between hours spent on homework and scores on the semester math test is calculated to be +.73. Assuming there are no outliers in the data set, provide an appropriate interpretation for this correlation.

Answer: This correlation is positive. It indicates that time spent on homework and performance on the test tend to go "hand in hand"; that is, those who spent more time on homework tend to do better on the test than those who spent less time on homework. Furthermore, the correlation is high, indicating that this tendency is strong. Because this correlation is based on a large sample of students, it is unlikely that a few outliers might have caused such a strong correlation.

a. Curvilinearity: The correlation coefficient discussed here is called a Pearson correlation coefficient, and it applies only to linear (straight line patterns) relationships between variables. If the relationship between two variables is not linear but curvilinear (resembling a curve), a Pearson correlation coefficient obtained from such data will usually underestimate the actual relationship. When such curvilinear relationships exist, it is essential to examine scatterplots of data to make sure that the presence of a curvilinear relationship is not misleading us. For example, people with low or high anxiety during a test will not do as well as those with moderate anxiety.

b. Restricted Range. Before drawing conclusions about an observed correlation between two variables, it is important to know whether data are available across the entire spectrum of possible scores for each of the variables. Where the range of scores has been reduced through selection, absenteeism, or some other factor, it is possible to have a very low observed correlation even though the two variables would be substantially correlated if the total population of scores were available. For example, the relationship between those students in graduate school who have taken the GRE and those who have not done so. Estimating the correlation between two variables in which data for one or both represent a very restricted range usually leads to an underestimate, sometimes, a severe underestimate, of the actual relationship.

c. Number of Cases. The smaller the data set, the greater the effect of such outliers. Especially in situations with ten or fewer cases, correlations may often be spuriously high or low, and should be treated with caution.

d. Magnitude of Correlation Coefficients. There are no simple rules for deciding how large a correlation coefficient must be to be practically meaningful. Larger correlation coefficients are indicative of a stronger association between two variables.

INTERPRETATION OF TEST SCORES. Because standardized norm-referenced tests are one of the most frequently encountered types of data for educational purposes, a wide variety of methods for reporting and interpreting such scores has been developed.

NORMS. In many test situations, test scores in their original form, that is, raw scores, usually lack this kind of comparability. One way to make test scores more interpretable is to create a norm for scores on that test but administering the test to a representative, and usually large, sample of the population with whom the test will typically be used. In a measurement situation, if test scores are interpreted relative to the norm, it is norm-referenced measurement, in contrast to criterion-referenced measurement. As discussed earlier, the major characteristic of norm-referenced measurement is that a person's performance on the test is compared to her peers; and whether the performance is good or poor depends on where the person stands relative to those in the norm group. In this sense, norm-referenced measurement does not inform us what the student can or cannot do in a certain content area, but instead, it tell us whether the students performance is better or worse than the performance of her peers. Many standardized achievement tests that are currently available can be sued to compare the performance of an individual with other students or to describe exactly what skills the student has or has not mastered.

1. What are the characteristics of good norms?

a. Relevance. Decide on the situation. Applying to a university means that it's important to compare the national norms rather than the local norms.

b. Representativeness. Standardized tests should include demographic information about the norming sample to help decide whether it is representative of your situation.

c. Currency. Outdated norms can be misleading.

d. Comparability. Similar tests with similar norming samples can make useful comparisons.

2. Norms are not standards. Standardized tests are given and marked in a standardized manner. Norms are descriptions of how some comparison group has performed. Standard denotes a level of achievement that has been established as a goal for all students. A standard is a description of what ought to be and a norm is a description of what is.

SCORES YIELDED BY STANDARDIZED TESTS

1. Grade equivalent scores. A GE of 5.0 means the student's score was comparable to the average score of students in the norm group at the beginning of the fifth grade. But because of how they are composed (a test administered to a norm group of a variety of grades including content area from all levels tested), they should only be used to determine whether a student is below, above, or average for the present grade.

2. Percentile. A percentile conveys information about the relative position of an individual's score in a distribution, and is defined as a score in a distribution below which that percentage of examinees scores. For a normal distribution, an IQ score of 115 is at the 84^th percentile because 84% of the distribution falls below that point. The major disadvantage of percentile scores is that they divide the population of distribution of scores into unequal units, especially near the ends of a score distribution which appears logical but may result in misleading conclusions. So increases of 10 points in the middle of the group vs 10 points at the higher end mean that the person in the higher end has actually made more progress. Percentile scores are also sometimes confused with percentage scores, which are often used in day to day testing situations in schools. A percentile score is expressed in terms of percentage of persons, and it is not related to the percentage of correct items. A percentile score of 100 does not imply a perfect raw score; neither does a percentile score of zero imply a zero raw score.

3. Z scores and other standard scores. A z score tells us how many standard deviations a raw score is below or above the mean. A negative z score indicates that the raw score is below the mean; a positive z score indicates that the raw score is above the mean and a z score of zero says that the raw score is equal to the mean. A z score for any person on a particular test can be computed by subtracting the mean score on the test from the person's score and dividing that by the standard deviation of the test. The z score usually allows us to make performance comparisons across tests which have different means and different standard deviations. By definition, a z score distribution has a mean of 0 and a standard deviation of 1.0. A z score can be transformed to another standard score with a different mean and a different standard deviation. This is done by multiplying the z score with your desired standard deviation and adding to the product the value of your desired mean. (Any standard score = (z * desired SD) + Desired Mean)

4. Normal curve equivalents. NCE scores were created to capitalize on the advantages of percentile scores, while avoiding (1) the disadvantages of dealing with unequal units at different points in the distribution and (2) the confusion between percentiles and percentage correct. NCEs have a mean of 50 and a standard deviation of 21.06.

5. Stanines. The word comes from a combination of standard and nine originally developed in the US Air Force. Stanines are computed by dividing a distribution into nine units, each with its own prespecified proportion of a distribution. The middle stanine straddles the median of a distribution.

RELATIONSHIP AMONG TYPES OF SCORES

Application Problem 6 and Answer. For a large group of fifth graders tested at the beginning of the school year (assuming the score distribution is approximately normal), which one of the following is most equivalent to the percentile score of 84? why are the other four not equivalent to the percentile score of 84?

a. A z score of 1.0 (equivalent to the 84^th percentile)

b. A Grade Equivalent Score of 5.0 (comparable to the average performance of the beginning fifth graders)

c. A z score of -1.0 (equivalent to the 16^th percentile)

d. A Normal Curve Equivalent Score of 50 (approximately equivalent to the 50^th percentile)

e. A stanine score of 5 (approximately equal to the 40^th-60^th percentile)

Some standardized reports provide information about a confidence interval around the national percentile and stanine scores. The confidence interval is related to the concept of measurement error, which is the amount of confidence we can have that the two scores are really different. Another score, the Objective Performance Index may be provided which indicates the student's mastery or non-mastery of individual objectives under each academic area.

p. 90. Computer generated reports of standardized achievement tests represent a tremendously valuable but largely untapped resource for teachers. Too often, standardized achievement testing is a district program in which teachers have littler interest or involvement. Properly used, standardized achievement tests give teachers important instructional information.

RELIABILITY OF EDUCATIONAL MEASURES. Reliability refers to the consistency of the results obtained, not the instrument itself. It is the reliability of the test scores obtained by using a test that is the criterion for evaluating the test use.

MEASUREMENT ERROR, TRUE SCORES, AND RELIABILITY

Measurement Error. Each test score invariably consists of two major components: true score and measurement error. The true score is the actual ability or performance level an examinee possesses on the trait or dimension the test is designed to measure. Measurement error is that part of the obtained score that is contributed by factors irrelevant to what is being measured. The question is always what is the extent of the measurement error. It is theoretically possible to get a reasonable estimate as a person's estimated true score is the average of his obtained scores from repeated administrations of the same test under the same testing conditions, disregarding the possible practice effect he would receive in repeatedly taking that test. So the obtained score is equal to the true score and the measurement error.

Random vs Systematic Error. Systematic measurement error occurs when a test consistently measures something not intended. By definition, random errors are random; that is, the amount and the direction of error differs unsystematically from one measurement to another and from one person to another. Random error reduces the reliability of test scores. Systematic measurement error reduces the validity of test scores, and interpretations based on such scores, thus it falls under the discussion of measurement validity.

DIFFERENT APPROACHES FOR ESTIMATING RELIABILITY

1. Test-retest reliability. A correlation between scores from two administrations of the same test to the same students. Alternatively, such reliability estimates may be called measure of stability or coefficient of stability.

2. Parallel for reliability. A correlation between scores of the same students on two equivalent forms of the same tests. Some other names for this type of reliability estimate are alternate form reliability, equivalent form reliability, measure of equivalence or coefficient of equivalence.

3. Internal consistency reliability. correlation or consistency indices among items on a single test. Other names for this approach are measures of homogeneity, measures of interim consistency. This includes: split-half method; Cronbach's coefficient alpha method; and Kuder-Richardson method.

4. Interrator reliability: A correlation between scores provided by two different scorers on the same test. Interscorer reliability is another name for this.

PROCEDURES FOR CALCULATING VARIOUS RELIABILITY ESTIMATES. Uses the Pearson Product-Moment Correlation (or Pearson r).

TEST-RETEST (SAME FORM) METHOD

Calculating a Test-Retest Reliability Coefficient.

Determine the appropriate interval between test and retest
Administer the test and obtain scores
Administer the retest and obtain the scores
Calculate the correlation coefficient between the two sets of scores.

Issues Related to Test-Retest Reliability.

1. Interval between two test administrations

2. What is the correct time interval?

3. Potential advantages of test-retest reliability. Because the same test is given twice, no random error can be attributed to different items being selected for use in different forms of the test. It avoids the difficult and time consuming tasks of constructing alternative forms deserving of the label "parallel". The test-retest reliability is easy to calculate. Appropriate for timed/speed tests. Often called the "index of stability".

4. Major uses of test-retest reliability. Use when giving speed tests; no parallel form is available; or when content is routine and familiar enough.

5. Measurement error source for test-retest reliability. Whenever measurement stability across time is the concern, test-retest reliability is appropriate and should be estimated.

Application Problem 1B. What is/are some major consideration(s) when determining the length of interval between test administrations when the test-retest approach is used to estimate measurement reliability?

Answer. The major considerations concerning the length of interval are (1) sufficient length to minimize memory effect; (2) the length should not be so long that the trait under measurement is likely to have changed; (3) the stability of the trait being measures: is it reasonably stable during a given length of interval?

PARALLEL FORM METHOD

Calculating a Parallel Form Reliability Coefficient

Administer Form 1 of the test.
Administer Form 2 of the test to the same students.
Score both Forms 1 and 2.
Correlate the two sets of scores by obtaining the Pearson r.

Issues Related to Parallel Form Reliability

1. With or without an interval?

2. Measurement error source estimated by parallel form reliability. In general, the measurement error contributed by content difference tends to be larger than the measurement error contributed by time, as in the test-retest method. Whenever more than one form of the same test exists, we should be concerned about measurement error contributed by content sampling and parallel form method is appropriate for such estimation.

3. Major Uses of parallel form reliability. Primarily used to help establish the equivalence of parallel forms. Usually not practical for most teachers. Care must be taken in interpreting parallel form reliability coefficients since they are likely to yield lower estimates of reliability than does the test-retest method. Because this reliability method controls two major potential sources of error, it is likely to be a more conservative and stringent estimate of reliability.

Application Problem 2. From a pool of 300 items that test knowledge of American history, you have drawn 100 to make a classroom test. A colleague sues your 300-item pool to devise another 100 item American history test so that both of you can have alternative, equivalent forms for use at the beginning and end of your history units. You have examined the two tests and judged their content to be very much equivalent. four of your students volunteer to take your exam and the new exam (on the same day) so that you and your colleagues can examine the equivalence of the two tests. Their scores are shown.

                Form 1 (yours) Form 2 (colleague's)

    Sandra     89                     76

    Sam         97                     82

    Tamara     93                     90

    Tom         89                     84

Calculate the means of the students' performance on the two test forms. What would you conclude about equivalence of the forms from those means? What is the parallel form reliability coefficient for the two forms of this test? How would you interpret such a reliability estimate?

Answer: Mean of Form 1 = 92 and mean of Form 2 = 83. The coefficient is .30 which means there is only a small correlation between the two which means the tests are not equivalent.

INTERNAL CONSISTENCY RELIABILITY. In practice the biggest drawback of test-retest and parallel form reliability is that both require two test administrations. For teacher-made tests, methods of estimating reliability that require only a single test administration are more feasible. Such methods are typically referred to as measures of internal consistency or measures of homogeneity since they are based on estimates of how well a test is correlated with itself.

SPLIT-HALF METHOD FOR ESTIMATING RELIABILITY

Reliability and Test Length. Generally speaking, the longer a test is, the more reliable it is. Since reliability is affected by test length, the correlation between two halves in split-half approach will be an underestimate for the whole test. For this reason, it is necessary to make adjustment of such correlation so that it more truthfully represents the reliability of the whole test. The Spearman-Brown Prophecy formula is intended to correct for such underestimate when reliability is calculated on half tests, giving an approximate estimate of what the reliability of the test would be had it not been artificially shortened into halves. The reliability of the total test (R) is equal to 2 times the correlation value of the two halves divided by the sum of 1 plus the correlation value of the two halves.

Calculating a Split-Half Reliability Coefficient.

Determine how the test is to be divided.
Add all the scores of odd-numbered items for each student to obtain that student's total score for that half of the test.
Add all the scores of even-numbered items for each student.
Treat the separate totals for odd and even items as separate tests for each student, creating two scores for each student, and thus "two sets" of test scores for your class.
Compute a Pearson r to determine the correlation between these two tests.
Apply the Spearman-Brown prophecy formula to correct the split half reliability estimate thus obtaining the reliability of the whole test.

Issues Related to Split-Half Reliability Estimates

1. Advantages and disadvantages of the split-half method. Advantage is that is requires only one test administration and thus avoids any effects of practice, memory, or differential test administration or scoring and some other practical problems. The disadvantages are that it is impossible to determine whether or not a test is completely parallel and there is no unique reliability estimate because different results will be obtained when different methods of splitting the test are found. Not to be used with speed tests.

2. Speed test and power test. Speed test contains easier questions but is timed so that no one can finish all the questions. A power test is dependent on how well questions are answered, not on how fast they can be answered. Don't use split-half reliability for speed tests.

3. Measurement error source in split-half approach. Time is not an error source because the test is taken as a whole; however content (item sampling) might be inconsistent.

CRONBACH'S COEFFICIENT ALPHA AND KUDER-RICHARDSON FORMULAS. Designed to offset the limitations of the split-half reliability measurements. These take into account all possible ways of dividing the test and do not need the Spearman-Brown prophecy formula to correct for the shortened length of the test.

Cronbach's Alpha Method for Estimating Reliability. Coefficient alpha can be used with all types of test items, whether test items are dichotomously scored as right or wrong (multiple-choice items) or are weighted (essays where partial credit is awarded). Use a computer as the formula is tedious.

Kuder-Richardson Formula 20 (KR-20). Used to calculate internal consistency for test items that are dichotomously score only.

Kuder-Richardson Formula 21 (KR-21). Because KR-20 was difficult to follow without a calculator, etc., the KR-21 came into being for classroom teachers as this formula requires only knowledge about the number of items in a test (K), the mean, and standard deviation (SD) of the raw test scores. Generally speaking, KR-21 provides an underestimate of KR-20.

Issues Related to Internal Consistency Measures of Reliability

1. Item homogeneity. Kuder-Richardson's and Cronbach's formulas are measures of test internal consistency, that is, the degree of item homogeneity. Test items are homogenous if they measure the same underlying ability or construct. For this reason, they are not appropriate when different subjects are included e.g. mathematics, science, reading, etc.

2. Speed test and missing items. Not to be used with speed tests because speed tests are, by their nature, easier and provide inflated scores. If missing values are deducted, it is not a good idea to use these methods because the reliability will be inflated because of the missing values.

3. Measurement error source. Time is not a factor but content and content homogeneity are. The more heterogeneous the items are, the lower reliability estimate will be obtained.

4. Major uses of internal consistency measures of reliability. Whenever test internal consistency or item homogeneity is the concern, these internal consistency measures are appropriate. Cronbach's is the most versatile and KR-21 is an underestimate of KR-20.

INTERRATER RELIABILITY

Test Objectivity. A test is said to be objective if two or more reasonable persons, given a scoring key and/or scoring criteria, would agree on how to score each item, thus agreeing on the number of points each examinee should get on a test. If such agreement is not routinely achieved, the test is said to be subjective.

Interrater Reliability and Its Calculation.

Administer the test to a group of students
Ask two independent scorers to score the same group of tests so that each student will get two scores on his test from the two independent scorers.
Calculate the correlation coefficient between the two sets of scores from the two scorers, and the correlation coefficient is the interrater reliability coefficient.

Use of Interrater Reliability and Its Measurement Error Source. Applicable whenever test scoring procedures contain certain degrees of subjectivity. Multiple choice or other objective items don't need this.

USE AND INTERPRETATION OF RELIABILITY ESTIMATES

General Guidelines.

1. Tests scores used for decisions about an individual student require higher degree of reliability than those used for making decisions about groups of students. When teacher-made tests are used in critical decisions about individual students, they should possess reliability coefficients of .80 or higher. By contrast, coefficients as low as .50 are acceptable if the tests are sued to make decisions about groups. Remember that the higher the reliability, the less error is associated with the test.

2. Higher reliability coefficients are essential if decisions based on test scores have important, lasting consequences that cannot be reversed or disconfirmed by other sources of information.

3. Lower reliability coefficients are tolerable for tests used in decisions that are of less consequence, are reversible, have only temporary impact, and can be confirmed by other sources of information.

4. Reliability coefficients for standardized achievement or aptitude tests should be around .90 or higher.

5. Lower reliability coefficients may be acceptable if the test is handicapped by factors or circumstances that would tend to lower its reliability, whereas higher coefficients may be judged inadequate if the test is advantaged by factors or circumstances that automatically enhance reliability.

HOW TO INCREASE THE RELIABILITY OF A TEST

Reliability of Scoring. Reliable scoring occurs when (1) different scorers agree with one another as they score the same test items or (2) a single scorer assigns the same scores to the same test, if scored on different occasions. The first of these is called interscorer reliability and the second is called intrascorer reliability. Usually only essay type questions suffer from scorer unreliability.

Group Variability. Generally speaking, the greater the variability f the scores, the higher the reliability estimate. If one wished to increase the reliability coefficient of a list, then it should be apparent that the test should be administered to a group with the maximum variability that is appropriate (that is, a group no more diverse than subsequent groups with which the test would normally be used).

Difficulty Level of the Test. Tests that are too easy result in clusters of scores close together at the top end of the scale. This lowers the reliability coefficient because as noted earlier, when scores tend to cluster, even small changes in scores between tests can produce major shifts in relative positions of the examinees. Try to make sure the difficulty level is appropriate for the students tested.

Number and Quality of Test Items. In general, the more items on a test, the higher the reliability. Longer test provide students with a better opportunity to show their true knowledge of content, whereas shorter tests increase the probability of their attaining high scores simply because of the selection of the small number of items included in the test. Since test reliability and test length generally go hand in hand, unusually high reliabilities for relative short tests may signal that something is amiss.

RELIABILITY AND STANDARD ERROR OF MEASUREMENT

Standard Error of Measurement. The reliability coefficient is a group statistic that depends on the variability of scores in the group tests and its purpose is to estimate the reliability of all scores yielded by a test. If one wishes to interpret an individual score, however, reliability coefficients are must less directly useful than a closely related concept known as the standard error or measurement (SEM) which relates measurement reliability to the accuracy of our interpretation for an individual's score. So if a student was given an IQ test 1000 times, the best estimate for the student's true score would be the mean. The standard deviation of this distribution is often referred to as the standard error of measurement. SEM is useful because it provides a tool to estimate score range within which the student's true score might fall. Since this is impossible, it turns out that the SEM can be estimated based on reliability estimate. This formula sates that the SEM is equal to the SD multiplied by the square root of 1 minus the reliability.

Score Bands and Profiles. Since an obtained score is only an approximation of the true score due to measurement error, it is often better to report each student's achievement as a band or an interval along the scale of possible scores rather than a s a single point. The practice of reporting and interpreting scores in terms of such bands or intervals helps to guard against the tendency to interpret small differences in the obtained scores of two students as representing true differences in their actual achievement levels.

THE RELIABILITY OF CRITERION-REFERENCED MEASURES

Reliability of Classification Decisions. Reliability coefficients are for norm-referenced interpretations. Criterion-referenced measurements are not designed to measure the difference between individuals. To find out the proportion of agreement which is a single number that summarizes the consistency of mastery/nonmastery classifications. This statistic sets up a Mendel-like matrix. For instance if all have been given two tests and scores on both tests there will be some individuals who have received a mastery level in both tests, some that did not receive it for either test and some who received the level in either one or the other test but not both.

VALIDITY: THE CORNERSTONE OF GOOD MEASUREMENTS. Inadequate measurement instruments are tolerated in our schools because (1) dependable information is lacking for many tests and (2) school practitioners are untrained in technical concepts and terminology necessary to understand what it means. Validity refers to the degree to which a test measures that which it is intended to measure. Validity is not a property of the instrument itself but it is an indication of the extent to which the interpretation of test results for a particular measurement situation are appropriate for the given purpose. Therefore, validity indicates how well a test measures what it is supposed to measure for a particular use of the test and if the scores based on the test are free from the influence of extraneous factors. Validity is meaningful only as it pertains to the particular use for which the tests results are intended; therefore, one should not speak of a test as "valid" or "invalid" but only with respect to the purpose for which they were intended.

AN INTEGRATED CONCEPT OF VALIDITY. The quality of the test and the tasks that make up the test certainly have an important influence on the validity of the score-based inferences, but several other factors also influence the validity of the scores and inferences based on the scores, including (1) the nature of the group tested, (2) the conditions under which the test is administered, (3) the scoring criteria and procedures used and (4) how the scores are used. The three most commonly collected types of validity evidence are (1) content-related evidence, (2) criterion-related evidence and (3) construct related evidence.

APPROACHES TO ESTABLISHING VALIDITY

CONTENT VALIDATION. Refers to the extent to which the test's items represent the entire body of content, often called the content universe or domain, that the test is designed to measure. The basic issue in content validation is representativeness; that is, how adequately do the test items represent the entire body of content that the test user intends to make inferences about. In the context of content validation, the word content refers to both the subject matter topics the test intends to cover and the cognitive processes that examinees are expected to apply to the subject matter.

Collecting Content Validation Evidence.

Collecting this evidence typically involves two major activities. First, to design a systematic plan for including items in a test so that different aspects of content and behavior domains will be adequately covered. Second, to make an informed judgement about the degree to which the subject matter covered by each item and the cognitive process that it elicits match the process and topic specified in the corresponding objective.

Evidence of content validity can be strengthened by adhering to the following steps:

Describe and specify as clearly as possible the domain of behaviors to be measured. For educational tests, this would typically require analysis of curriculum guidelines, courses of study, syllabi, textbooks, and other related items. This is to ensure that all relevant content areas will be covered.
The domain of behavior outlined in step 1 should be analyzed and subcategorized into more specific topics, subject matter area, or clusters of instructional objectives. For example, if we were to test how well you have learned and can apply the content of this section on validity, we might decide to subdivide the content into the following categories: face validity, content validity, criterion related validity, and construct validity. Or we might have three categories of objectives: knowledge, understanding, and application.
Draw up a set of test specifications that show not only the content areas or topics to be covered during the instructional processes, or objectives to be tested, but also the relative emphasis to be placed on each.
Decide how many questions to include in the test.
Determine how many items will need to be developed in each cell to make sure there is representative coverage of all content areas and categories of instructional objectives.
Construct or select test items appropriate for each cell.
Have another teacher or a content expert construct a second set of items, using the same table of specifications. Reviewing similarities and differences between the two sets will help identify unwitting biases you might bring to the item writing task as well as strengthen the final set of text items that are selected.

CRITERION-RELATED VALIDATION. Refers to the extent to which one can infer from an individual's score on a test who well she will perform some other external task or activity that is supposedly measured by the test in question. The degree to which scores on the test being validated can predict performance on the criterion is determined by (1) administering the test being validated to a representative group of individuals for whom scores on the criterion can be obtained and (2) then computing a correlation coefficient that statistically describes the degree to which the two sets of scores are related. The resulting correlation coefficient is called a validity coefficient. External criteria can be of two types: criterion measures taken a approximately the same time as the test is administered, or criterion measures taken significantly later. Correlation with the former produces evidence of concurrent validity; correlation with the latter yields evidence of predictive validity.

Evidence of Predictive Validity. Refers to how well a measure predicts or estimates future performance on some criterion other than the test itself.

Administer and score the test you will use for the prediction.
Wait for an appropriate time.
Measure the external criterion on which you are attempting to predict performance.
Correlate the scores on the predictor test with measurements on the external criterion.
Interpret the resulting validity coefficient.

Evidence of Concurrent Validity. Another task designed to provide different information about an individual's expertise. For example, a mechanical aptitude test might yield dissimilar data as compared to a hands-on experience sued as concurrent data. In concurrent validity, both performance on the test being validated and that on the criterion can be obtained at the same time, and the former is used in place of the latter because of such factors as ease, convenience, and cost effectiveness.

Expectancy Tables to Illustrate Criterion-related Validity. Construct a table that illustrates the relationship between the scores on the predictor test and the scores on the measure of the criterion. Such a table is usually called an expectancy table.

Validity and Reliability of the Criterion. It is essential that the validity and reliability of the criterion be well established. A criterion will not be useful if it unstable or invalid. For example, success in life is not a good criterion.

CONSTRUCT VALIDATION. A construct is an unobservable, postulated attribute of individuals that we create in our minds to help us explain or theorize about human behavior. Since constructs do not exist outside the human mind, they are not directly measurable. Construct validation is the process of collecting evidence to support the assertion that a test measures the construct claimed by the test developer.

FACE VALIDITY. Refers to the degree to which a measurement instrument appears to measure what it is intended to measure, to those who administer and/or take the test.

CONSEQUENTIAL ASPECTS OF VALIDITY. Consequential validity is concerned about the unintended, usually negative, consequences of testing in a particular situation.

FACTORS THAT CAN REDUCE VALIDITY

Interpreting and Improving Validity Coefficients. Content and construct validity approaches typically do not yield statistical validity coefficients and therefore cannot be interpreted in as precise and universally understood terms as can the criterion related validity approaches -- predictive and concurrent.

For Those Who Want to Dig Deeper.

Predictive validity coefficients will typically be lower than concurrent validity coefficients.
The size of a predictive validity coefficient will be affected by the reliabilities of both the criterion and the predictor.
Validity coefficients derived from scores of homogeneous groups will be lower than those from scores of heterogeneous groups.
Increasing the length of the predictor test will slightly increase predictive validity.

VALIDITY OF CRITERION REFERENCED MEASURES. Standards for Educational and Psychological Testing refines the definition of validity to be a unitary concept and places increased emphasis on the idea that construct validity evidence encompasses both content validity and criterion related validity.

RELIABILITY, VALIDITY, AND THE USEFULNESS OF A MEASURE

A measure must be both valid and usable to practitioners and both concerns must be kept in mind when such measures are designed or selected. In fact, usefulness is the ultimate criterion educators should apply in choosing or developing every measure that will be used in our schools.

SELECTED BIBLIOGRAPHY

(September, 2000). Policy Implications of the 32^nd Annual Phi Delta Kappa/Gallup Poll, Phi Delta Kappan, 49-52.

Allen, D.W., (1999). Schools for a New Century: A Conservative Approach to Radical School Reform, Westport, CT: Praeger Publications.

Bracey, G.W., (January/February, 1995). The Assessor Assessed: A ‘Revisionist’ Looks at a Critique of the Sandia Report, Journal of Educational Research, 88(3), 136-145.

Bracey, G.W., (June, 2000). ‘Diverging’ American and Japanese Science Scores, Phi Delta Kappan, 791-792.

Bracey, G.W., (March, 1996). 75 Years of Elementary Education, Education Digest, 61(7), 26-30.

Bracey, G.W., (March, 1999). Getting Along Without National Standards, Phi Delta Kappan, 548-549.

Bracey, G.W., (May, 1998). Tips for Readers of Research – No Causation from Correlation, Phi Delta Kappan, 711-712.

Bracey, G.W., (May, 1999). The Forgotten 42%, Phi Delta Kappan, 711-712.

Bracey, G.W., (May, 2000). The TIMSS Final Year Study and Report: A Critique, Educational Researcher, 29(4), 4-10.

Bracey, G.W., (November, 1994). Don’t Let Misunderstood Data Do a Number on You!, Education Digest, 60(3), 47-52.

Bracey, G.W., (November, 1997). What Happened to America’s Public Schools? American Heritage, 48(7), 38-47.

Bracey, G.W., (November, 1997). The Sources of Statistics, Phi Delta Kappan, 248-249.

Bracey, G.W., (November, 1998). Test Scores of Nations and States, Phi Delta Kappan, 247-248.

Bracey, G.W., (October, 1996). The Sixth Bracey Report on the Condition of Public Education, Phi Delta Kappan, 127-138.

Bracey, G.W., (October, 1999). The Ninth Bracey Report on the Condition of Public Education, Phi Delta Kappan,, 147-168.

Bracey, G.W., (October, 2000). The 10^th Bracey Report on the Condition of Public Education, Phi Delta Kappan, 133-144.

Bracey, G.W., (September, 1994). The Media’s Myth of School Failure, Educational Leadership, 80-83.

Bracey, G.W., (Summer, 1998). Johnny’s Grades Aren’t So Bad, Wilson Quarterly, 22(3), 126-129.

Bracey, G.W., Rotten Apple: The Back to Statistics 101 and My Arithmetic is Shaky Too Award, Phi Delta Kappan, 124.

Bracey, G.W., Rotten Apple: The Statistics from Thin Air Award, Phi Delta Kappan, 122.

Coyle, J., Final Answer? Computer Testing’s Real Payoff, This District Found, is Fast and Flexible Data. Retrieved on March 16, 2001 from http://www.electronic-school.com/2001/03/0301f8.html

Elam, S.M., Rose, L.C., (September, 1995). The 27^th Annual Phi Delta Kappa/Gallup Poll of the Public’s Attitudes Toward the Public Schools, Phi Delta Kappan, 77(1), 41-62.

Elam, S.M., Rose, L.C., Gallup, A.M., (September, 1991). The 23^rd Annual Gallup Poll of the Public’s Atitudes Toward the Public Schools, Phi Delta Kappan, 41-56.

Elam, S.M., Rose, L.C., Gallup, A.M., (September, 1994). The 26^th Annual Phi Delta Kappa/Gallup Poll of the Public’s Attitudes Toward the Public Schools, Phi Delta Kappan, 41-56.

Foshay, R., Ph.D., (January, 2000). A Guide for Implementing Technology, PLATO, Inc., 1-36.

Freeman, J., (February, 1995). What’s Right with Schools? ERIC Digest. Retrieved on March 1, 2001 from the World Wide Web: http://www.ed.gov/databases/ERIC_Digests/ed378665.html

Huelskamp, R.M., (September, 1993). The Second Coming of the Sandia Report, Education Digest, 59(1), 4-9.

Jensen, E., (1998). Teaching with the Brain in Mind, Alexandria, VA: Association for Supervision and Curriculum Development.

Kleiner, C., Change Comes to Teacher Education, U.S. News Online. Retrieved on March 16, 2001 from the World Wide Web: http://www.usnews.com/usnews/edu/beyond/grad/gbed.htm

Langdon, C.A., (November, 1996). The Third Phi Delta Kappa Poll of Teachers’ Attitudes Toward the Public Schools, Phi Delta Kappan, 244-250.

McKenzi, J., (January, 2001). Head of the Class: How Teachers Learn Technology Best. Retrieved on March 16, 2001 from the World Wide Web: http://www.electronic-school.com/2001/0l/0101f2.html

Phelps, R.P., Why Testing Experts Hate Testing, Thomas B. Fordham Foundation, Retrieved on March 3, 2001 from the World Wide Web: http://www.edexcellence.net/library/phelps.htm.

Rose, L.C., and Gallup, A.M., (September, 1998). The 30^th Annual Phi Delta Kappa/Gallup Poll of the Public’s Attitudes Toward the Public Schools, Phi Delta Kappan, 41-56.

Rose, L.C., Gallup, A.M., Elam, S.M., (September, 1997). The 29^th Annual Phi Delta Kappa/Gallup Poll of the Public’s Attitudes Toward the Public Schools, Phi Delta Kappan, 41-56.

Tienken, C. and Wilson, M., (2001). Using State Standards and Tests to Improve Instruction, Practical Assessment, Research & Evaluation 7(13), 1-8. Available online: http://ericae.net/pare/getvn.asp?v=7&n=13.

Too Much Testing of the Wrong Kind; Too Little of the Right Kind in K-12 Education. Retrieved on March 3, 2001 from the World Wide Web: http://www.ets.org/research/textonly/pic/testing/tmtintro.html

Worthen, B.R., White, K.R, Fan, X., Sudweeks, R.R., (1999). Measurement and Assessment in Schools, New York, NY: Addison Wesley Longman, Inc.

To The Top

Hosted by www.Geocities.ws