Renowned Statistical Researcher “No Longer Comfortable” With Being Associated With Educational Testing
The field of psychometrics is the theory and technique of psychological measurement. Psychometrics is currently used to assess educational achievement, but is also used to measure other traits like abilities, attitudes, and personality. The theory behind it is that we can objectively measure someone’s performance through a series of tests and then come up with a rating for them for proficiency or mastery. Gene Glass, Arizona State University, National Education Policy Center, University of Colorado Boulder, was one of the top psychometricians in the United States, having invented the term “meta anaylsis” which is used to look for statistical patterns in data collected from a number of studies. He worked for many years to on educational testing including a stint advising the NAEP in 1980. He always had some misgivings about how criterion referenced tests were being used to measure educational achievement, having noted in a 1978 paper that the concept of a “cut score” below which anyone was deemed non-proficient was an arbitrary absurdity. Mr. Glass now believes we have set the value of educational testing too high for his name to continue to be associated with it.
His blog from August 17 Why I Am No Longer a Measurement Specialist, explains why he is distancing himself from the current obsession over and blind faith in educational testing.
In the last three decades, the public has largely withdrawn its commitment to public education. The reasons are multiple: those who pay for public schools have less money, and those served by the public schools look less and less like those paying taxes.
The degrading of public education has involved impugning its effectiveness, cutting its budget, and busting its unions. Educational measurement has been the perfect tool for accomplishing all three: cheap and scientific looking.
His paper, Standards and Criteria Redux, first published in 1978 in the Journal of Educational Measurement in 1978 (Vol. 15, 237-261), explains the history of psychometric testing and explores the implementation of cut scores.
Glass credits Robert Glaser as having applied the field of psychometrics to education and coined the term “criterion-referenced test.” This is a test that measures a child’s knowledge and is not meant to compare one child’s performance to another’s. Glaser (1963) wrote:
Underlying the concept of achievement measurement is the notion of a continuum of knowledge acquisition ranging from no proficiency at all to perfect performance. An individual’s achievement level falls at some point on this continuum as indicated by the behaviors he displays during testing. The degree to which his achievement resembles desired performance at any specified level is assessed by criterion-referenced measures of achievement or proficiency. The standard against which a student’s performance is compared when measured in this manner is the behavior which defines each point along the achievement continuum. The term “criterion,” when used in this way, does not necessarily refer to final end-of-course behavior. Criterion levels can be established at any point in instruction where it is necessary to obtain information as to the adequacy of an individual’s performance.
Along such a continuum of attainment, a student’s score on a criterion-referenced measure provides explicit information as to what the individual can or cannot do.
Robert Mager was working at the time (1962) on behavioral objectives and how those could be related to a standard. Mager believed that once we had a “minimum acceptable performance for each objective, we will have a performance standard against which to test our instructional programs; we will have a means for determining whether our programs are successful in achieving our instructional intent. (p. 44)”
This background is important because the early work of Mager and Glaser was distorted in the 1970’s by a number of education researchers to support the idea that there was a single point below which we could not call a child’s performance acceptable. Those researchers advocated for a performance standard. They put a line somewhere on Glaser’s continuum. They would rate, rank or grade the child based on the number of correct answers to questions like, “the student must be able to correctly solve at least seven simple linear equations within a period of thirty minutes, or Given a human skeleton, the student must be able to correctly identify by labeling at least 40 of the. . . bones; there will be no penalty for guessing.” (p.44 Mager, 1962)
That might be reasonable if we had some assurance that the questions were all equally difficult or written in a clear enough manner for all students. In 1977 the results from a grade seven assessment by the Department of Education in New Jersey showed that “Pupils averaged 86% on vertical addition, but only 46% on horizontal addition.” Does this mean that students who got the horizontal addition question wrong don’t know how to add, or don’t do as well adding that way? Which way is it more important to add? If they convert it to a vertical problem in the margin to calculate are they cheating?
Glaser called this language of performance standards “pseudoquantification, a meaningless application of numbers to a question not prepared for quantitative analysis.” By 1980 when he was on a NAEP panel, they were getting pressured to place this line on the NAEP results. There was much discussion and resistance to doing so as many panel members believed more in Glaser’s continuum of performance. Glass wrote,
The project was under increasing pressure to “grade” the NAEP results: Pass/Fail; A/B/C/D/F; Advanced/Proficient/Basic. Our committee held firm: such grading was purely arbitrary, and worse, would only be used politically. The contract was eventually taken from our organization and given to another that promised it could give the nation a grade, free of politics. It couldn’t.
Why did the committee feel such grading was arbitrary? Often when setting cut scores the initial discussion is fairly arbitrary. One expert may feel that 7/10 questions is acceptable. Another may think we should hold students to a “higher standard” and want 9/10. Glass’s paper recalled the 1975 attempt to apply performance standards (cut scores/grades) to the assessment results for citizenship and social studies developed by the National Council for the Social Studies.
A fully representative panel of nine judges (3 minorities, 5 women, 3 under the age of 30) was formed. Each judge was shown an assessment item and then asked, “Realistically what level of performance nationally for the age level being considered would satisfy you for this exercise? (1) less than 20% correct, (2) 20-40%, (3) 41-60%, (4) 61-80%, or (5) more than 80%?” The panel rendered over 5,000 judgments in a three-day sitting, and it has been reported that “…panel members agreed more often than not, but at times spread their responses across all the available categories” (Fair, 1975, p. 45). About half of the exercises were given a “satisfactory performance level” of “more than 80%.” About 35% of the exercises would satisfy the panel if between 60% and 80% of the examinees answered correctly. The desired performance levels were generally above the actual rates of correct response. What is to be made of the gap? Ought it to be read as evidence of the deficiency of the educational system; or is it testament to the panel’s aspirations, American hustle and the indomitable human spirit (“Man’s reach Should exceed his grasp, etc.”)?
Glass noted in his latest post that others are catching on to the problem with the blind adherence and high stakes consequences for test scores.
“There has been resistance, of course. Teachers and many parents understand that children’s development is far too complex to capture with an hour or two taking a standardized test. So resistance has been met with legislated mandates. The test company lobbyists convince politicians that grading teachers and schools is as easy as grading cuts of meat. A huge publishing company from the UK has spent $8 million in the past decade lobbying Congress. Politicians believe that testing must be the cornerstone of any education policy.
The results of this cronyism between corporations and politicians have been chaotic. Parents see the stress placed on their children and report them sick on test day. Educators, under pressure they see as illegitimate, break the rules imposed on them by governments. Many teachers put their best judgment and best lessons aside and drill children on how to score high on multiple-choice tests. And too many of the best teachers exit the profession.
When measurement became the instrument of accountability, testing companies prospered and schools suffered. I have watched this happen for several years now. I have slowly withdrawn my intellectual commitment to the field of measurement.
So Glass is out. So are good teachers. Many parents are out too. The ones still in are the “experts” in testing companies and educational suppliers who see a never ending stream of revenue from everyone chasing the imaginary cut score.
A final thought from Glass’s 1978 paper.
The reader can justifiably ask, “What manner of discourse is being engaged in by these experts?” How is one to regard such statements as “the student must be able to correctly solve at least seven simple linear equations in thirty minutes” or “90 percent of all students can master what we have to teach them.” If such statements are to be challenged, should they be challenged as claims emanating from psychology, statistics, or philosophy? Do they maintain something about learning or something about measurement? Are they disconfirmable empirical claims or are they merely educational rhetoric spoken more for effect than for substance?