Voices in Education

Assessments That Measure What Matters
My father-in-law was a classical pianist. He immigrated to the United States from Austria in the early forties. His first official act was to apply to the American Federation of Musicians for a union card, which he needed in order to work. To get this card he had to pass a simple test: the examiner pointed to a piano and asked him to play something. He chose Beethoven’s Sonata no. 8 in C minor, nicknamed the Pathétique. He played the first, very soft chord and the test was over—clearly he was a musician.

Many years later, I found myself on a school committee, interviewing a man who was applying for a job as a physics teacher in the local high school. I asked him to explain to me the phenomenon of “zero gravity” experienced by astronauts in orbit. Three sentences later I knew that the applicant, despite being certified to teach physics, had no real understanding of the subject.

Just as a thermometer measures its own temperature, an educational assessment by definition can only evaluate the test-taking ability of the subject. To measure, say, the temperature of a roast, one must ensure that the thermometer is in contact with the meat and then wait until both have reached thermal equilibrium. Only then does the thermometer’s reading correspond to the temperature one is interested in. The analogy is obvious: an assessment of a learner’s skill, knowledge, or understanding is only valid if the performance required (i.e., taking the test) is a reliable marker for the constructs of interest.

No one, I assume, would argue that my father-in-law should have been required to take a multiple-choice test on fingering, say, or to identify middle C on the staff. Such low-level skills would not be expected to correlate closely to musical ability. And evidently the ability to pass a certification test was not a reliable indicator of the would-be physics teacher’s grasp of the subject.

We see it all the time: students learn test-taking skills without learning the content. And no wonder—test results have become so important that teachers spend inordinate amounts of time teaching precisely those superficial and largely irrelevant skills.

Why is it proving so difficult to design assessments that reliably measure what we really care about? One reason, I submit, is that we are stuck with an outmoded, static medium (hint: it’s called paper); another is that we take for granted that the only way to assess learning is to ask questions and score the answers, rather than posing complex, multi-step challenges and analyzing a subject’s actions.

The teacher I interviewed had passed a test that probably contained questions like this one: “The International Space Station orbits the earth at a height of 220 miles. Calculate its acceleration and that of the astronauts inside it.” Such a question requires the recall of an equation or two, and the ability to operate a calculator—skills the teacher evidently possessed. But what if the task had involved demonstrating weightlessness in orbit, using a computer model?

Imagine such a model—let’s call it the “Solar System Construction Kit”—that can contain various objects (planets, moons, space stations) that move about on the screen under their mutual gravitational attraction. The display has to be able to zoom in and out over many scales, and there should be a way to speed up or slow down time, so while you’re at it, imagine those features, too. Using this tool, we could create a sun, nine planets (or is it eight these days?) and their various satellites, and then we could challenge our prospective physics teacher to add a space station with an astronaut and use the resulting model to demonstrate and explain the “zero weight” phenomenon.

If you were observing him, you would be able to figure out very quickly whether or not he understood the physics. You might start by noticing where he put the space station. Does he place it near Earth, where it belongs, or far away (perhaps in hopes of finding a place where there is “no gravity”), or possibly somewhere between Earth and the Moon where he hopes the gravitational forces will cancel out? Does he launch his model space station into orbit, or does it crash or wander endlessly through space? If he does manage to insert it into a stable orbit, let’s program the computer to detect that fact and react by asking him to draw arrows representing the accelerations of the space station and its occupant. Oh, and we might also ask him to calculate them.

As the subject does all this, of course, the computer will be monitoring and analyzing his actions, much as a human examiner might look over his shoulder, asking probing questions from time to time and evaluating his understanding of the relevant physics concepts.

This sort of thing is not pie in the sky; a number of systems already do it (I discuss some of them in my chapter in the book New Frontiers in Formative Assessment, to be published in December by Harvard Education Press) and many more are on the way. After all, if Amazon and Google can figure out what books you might like or whom you’re likely to vote for, we ought to be able to determine how much you know and can do.

And maybe even teach you something along the way!


About the Author: Paul Horwitz is a senior scientist at The Concord Consortium. He is also a contributor to New Frontiers in Formative Assessement edited by Pendred E. Noyce and Daniel T. Hickey (Harvard Education Press, 2011).