Monday, March 26, 2012

Bad Tests, Teacher Evaluations and Incentives

I taught 95 students how to decode the standardized test questions and answers without reading the passage.  Then I gave them an 8-question, no passages reading test, and where they should have gotten 25 per cent (random guessing with four answer options), the group average was over 60 per cent.  The identifiability of patterns in questions and answers calls into doubt the accuracy and usefulness of Washington state's standardized reading multiple choice questions.

More than 40 states have joined the so-called Common Core State Standards and the associated testing consortium that allows the members to coordinate their standard-setting and assessment.  This is the closest we've yet gotten to a national education standard.  
One consequence of this development is that we can move even more determinedly forward to connect students' standardized test scores to teacher evaluations.  Along the way, though, many teeth are gnashing over the mechanics of just how to make this linkage.  For instance, a significant impetus last fall for the teacher strike in the district where I live (Tacoma, WA) derived from the contention over the way to join evaluations and scores, and the resolution of that strike created a(nother!) committee to figure it out.  Their recommendations are still forthcoming.
But the anxiety over implementation is only part of the story.  We would do well to take a serious look at the test side of the equation by itself.  Education reformers and the business community, among others, come out big for testing, with business cheering for increased accountability of teachers and schools by way of the politically elusive test-evaluation connection.
The debate about teacher evaluations and test scores proceeds along predictable lines.  Test opponents are tagged as unionists only interested in keeping cushy jobs.  Test supporters are thought woefully out of touch about how class rooms really function.  All the while, the test itself, that thing and process on which so much of the acrimony suspends, sits rather unassumingly by.  We talk little about the test or the test process, and, by implication, bear great faith in the device and its accuracy and reliability in assessing students' knowledge and capacities.
But just how much faith should we put in the test and in procedures that use the scores as evidence to determine anything beyond whether a student did well or poorly on that specific test?  Can we rely on the tests to actually and accurately measure knowledge and capability in a particular subject area?  
It turns out that for the reading test at least, the answers may be disquieting.  The reading MSP (Measurement of Student Progress, Washington's state standardized test) exhibits patterns which make it more an examination of 'test taking' than of reading.  Sampling a few test questions indicates that we can discern a set of predictable patterns in the question-making and the answer construction.  These patterns give savvy test takers an advantage and at the same time make the test a less than useful or accurate measure of a student's reading performance, or of how well a particular teacher is doing his or her work.
The following tutorial, is based on a 3-question OSPI (WA's state education agency) 'released item.' Released items are test material available at the OSPI web site, and consist of a passage and question set that was earlier beta-tested on real students, unbeknownst to them, as a non-scored section of a real MSP test, will prepare you to "take the MSP" by using the patterns identified here.  
Following the tutorial are two sets of four questions from other released items.  See if you can't make a pretty good guess about the answers, or at least narrow your choice down to the two best answers, or identify the easy way to answer the question.  (Correct answers follow at the end.)
Yes, to best evaluate the presence and identifiability of patterns, this test process will proceed without any reading passages.  Just the title, the questions and the answer options.
Most 8th grade students (the group I teach) have up to 5 years of experience with Washington's standardized tests and when I explained the patterns described below, many realized they had a general, if somewhat unconscious, awareness that they knew or at least recognized them.   All patterns explained here have been identified by 6 years of working with OSPI released items, and listening to students observe--in ways only teenagers can--the similarity between released items and actual MSP items.  Teachers are forbidden—along with everyone besides students and the state bureaucrats—from looking at actual MSP tests.
Clearly, released items are not test items, but with a review process as tortuous as what each item must pass through, released items and actual test items are unlikely to be significantly different.  After all the bias and sensitivity screening prospective test items go through.)  If they are different, then the students' years of experience with real MSP questions shouldn't transfer to success on released items.
Just what are these patterns, then?  The best explanation comes from looking at the examples below.

1.   What is the main idea of "Excerpt from Iditarod Dream"?

  1. Sled dog racing is a thrilling and dangerous sport.
  2. Sled dog racing requires teamwork and training. 
  3. Sled dog racing requires specialized equipment. 
  4. Sled dog racing can be a family activity.
First, this is a main idea question, so we need to have a sentence that is 'worthy' of serving as a main idea.  It's hard to explain, but ask a nearby 8th grader, he or she will understand that some of these just don't 'feel' like MSP-type main idea answers.  They're not serious or important or high-quality enough, or at least they're not as serious as some other options.  
'Specialized equipment' isn't as important a point as either 'teamwork and training' or 'thrilling and dangerous.'  'Family activity' is almost non-sensical in that it violates expectation of what we might think or hear about dog sledding.  While there may indeed be a family out there that makes sledding one of their activities, this would be an oddity.  The MSP doesn't usually make main points out of oddities.
'Teamwork and training' or 'thrilling and dangerous' are the best options, then.  But the MSP often includes readings with a kind of moral element.  There are an unusual number of uplifting or inspiring stories.  Whether a little known figure gallant for service to others, or a determined soul who has surmounted obstacles to achieve something and/or (better yet) learned some important life lesson, MSP questions go through a vetting process that renders controversial or negative material unlikely to make the final cut.  
Thinking of it in this way, 'thrilling and dangerous' has just a hint of the selfish and irresponsible.  'Teamwork and training,' by contrast, is the kind of emphasis the MSP can and likes to support.  I'd probably go with that...and I'd turn out to be right.

According to "Excerpt from Iditarod Dream," why does Dusty decide to help the other racers build a fire?

  1. He uses the fire light to see the trail markers
  2. He thinks the fire will help him stay awake. 
  3. He is following the rule of the wilderness. 
  4. He needs to cook the dogs' frozen meat.
MSP can tend toward the 'unusual' option.  C jumped out immediately because it's of a different quality from the others, which are all specific and concrete things.  C, by contrast, is an interestingly oblique answer that hints of something 'higher' than the other three.  The combination of uniqueness and grandness makes C too hard to pass up, and doing so would yield a wrong answer--C is correct.
According to "Excerpt from Iditarod Dream," how would Dusty most likely react to entering another dog sled race?

  1. He would be hopeful because he came so close to winning.
  2. He would be nervous because he had trouble staying on the trail at night.
  3. He would be excited because he knew how it felt to cross the finish line in the lead.
  4. He would be anxious because he ran out of supplies and needed more for the next race.
On first blush, this ostensibly 'prediction' question seems unanswerable without reading the passage.  Indeed, how can we predict anything with such a dearth of knowledge of the situation.  Further, each question in this response contains a detail that we can only guess at, so we're left with a higher degree of uncertainty than in the previous questions.  But ultimately we are trying to get the correct test answer here, not predict something about Dusty, so things are not as hopeless as they seem.  
First, cover all the answers from the word 'because' onward.  You are left with a list of adjectives about how Dusty would feel.  The old advice to 'look for the stronger word,' and the current advice to think about uplift and inspiration could be of some help.  Granted, every test item won't work this way, but following these two 'rules,' C--excited--breaks out to an early lead in our race to decide.  Option A has the tinge of the overly competitive.  MSP probably tends to de-emphasize things like winning.  Just look how the test renders 'winning' in option C--'knew how it felt to cross the finish line in the lead.'  They seem to be at pains to avoid a word that sits uncomfortably in the social culture of collaborative education.  8th graders may not follow or care about the culture of education, but they do pick up on patterns, and the combination of that quirky way of saying 'win' and the most upbeat adjective--'excited'--make C a plausible option.  
Granted, this explanation is much more abstruse and convoluted, so do some more work by covering every answer from the adjective back to the beginning of the sentence and leave exposed what really are the first part of four conditional statements.   For instance, option A can rearrange to say "He came close to winning, so he will be hopeful."  
You'll note that not all the events can occur in the story.  How could Dusty come close to winning and cross the finish line in the lead? He can't, so either A or C is incorrect.  It's unlikely that both A and C are incorrect, as Dusty had to either win or not win, and the answer set would be strangely vexing if one of the causal elements (latter part of the statement) were true, but that answer were wrong. It would indeed be a more challenging test if readers had to actually make inferences about Dusty's feeling--by, say, dealing with several true statements.  But such are not as easily graded as the MSP needs to be.  
Using the 'deep' or 'serious' test, D is the least likely--it does not have the feel of high level of thinking. B is a contender, but its chances are reduced by the difficulty of both A and C then being incorrect.  I'm going with C, the odd wording for 'win' being too strong a pull to avoid.  
(At another time and place it would be worth considering how this unusual wording is really meant to distract some testers.  Some students will be vexed by the difference between 'win' and 'cross the finish line in the lead,' and so will not be sure if this is the right answer.  This vexation will help ensure that some students will get the answer wrong, thereby creating the necessary 'distribution' of answers and scores.  This opens a whole different problem--the effort to measure whether individual students are meeting a standard by misusing testing devices and procedures whose design actually distributes students across the outcome spectrum.)
Now, when you eventually do read the passage, all you really have to do is simply confirm which of the events described in the latter part of each sentence actually happened.  Did Dusty win?  If so, it's C.  Most of the time the option set will contain only one accurate description of an event which actually occurs in the story, making the corresponding answer option the obvious choice.   
The student taking the test does not really have to predict, s/he just has to look for which of the events described in the answer options really did happen.  Almost certainly only one occurred, but in the effort to make the test something more than matching (the story event to the correlated answer option) some slightly inaccurate permutation of one of the other events will appear as an answer.  In this case, the oddly inaccurate one is the 'going off the trail' option, and the correct answer is C.  (I confess, I've still not read the passage accompanying these questions, but my 7th grade daughter confirmed these details.)
Interestingly, this question was categorized as 'comprehension,' which presumably ranks below 'analysis' on the intellectual spectrum.  The question is framed to look like a prediction question but really isn't.  The student's ability to comprehend which detail (from the latter half of the answer options) actually happened in the story is really what's getting tested.  They needn't predict anything.  This question was the hardest to answer without reading the passage, but many testers (including my daughter) were able to narrow it down to two answers and C was one of them. 
With the revelation of these patterns fresh in mind, I determined to figure out whether real test-taking students discerned the same patterns, or if I just unlocked my own odd, but ultimately individual, insight into the test.
I administered two different versions (one version is reproduced below) of 8-question, no reading passage tests to 95 8th grade students, and one 7th grader--my daughter.  I provided the title of the passage, followed by 4 questions on the passage, each question with 4 answer options.  Presumably, each student's average score would be in the area of 2 corrects (1 out of 4), as would the overall average of all students.  
Having identified and explained the test patterns to the students, I predicted that scores would be significantly higher than what random chance would expect.  Indeed, the average for the 95 students was 5.1 corrects, 2 ½ times the expected outcome from chance.  The standard deviation of the set was 1.38.  The t-test p value for these results 0.0001.  In other words, the probability that 95 testers would average 5.1 corrects when they should have averaged 2 (according to chance) is exceedingly low.
These findings raise a variety of questions about the MSP multiple choice questions, and none of the likely answers are good.  Fundamentally, is this as good a reading test as we hope and want it to be?  Or is it a less a reading test than a test on test-taking?  
If students can identify patterns of questions and their answers and get much better than expected scores just from knowing and seeing those patterns, then this particular standardized test is not really testing reading ability.  We could certainly claim that savvy (i.e., 'smart') students will more likely figure out the patterns and get the advantage on the test, and that such savvy is positively related to reading ability, but this adds another layer of uncertainty into the assessment process.  
If the test writers have corrected these patterns in the actual test items, the rest of us would never know, as gaining access to the test is not an easy process.  The screening process for test questions, with at least three phases of content, bias and sensitivity filtering, narrows the range of plausibly acceptable items, and increases the probability that the released (rejected) items are essentially similar to the actual test items.  When I described the patterns in the released items, my 8th grade students certainly seemed familiar with them.  This indicates some compatibility between the rejected and the accepted test items.
Given all this—the substantially better than expected student scores, a test-writing process that probably generates a narrow range of question/answer design alternatives, the secrecy of the test production—we can only wonder at just how useful this test really is.  But I can say this, if my professional evaluation is going to be tied to such a test, I'm teaching every one of these tricks.  


Anonymous said...

Well, the objective of the tests is clearly not to allow for testee opinion, as there is only one "right" multiple choice answer. The test is not for comprehension, as the test answers are clearly the opinion of the testers. Therefore the only thing the test can do is assess test-taking ability, which is apparently valued over all in the new educational philosophy. Indeed, test-taking skills will get your students into college, med school, graduate programs... just about anywhere except the state and federal legislatures, where the anti-intellectual bullies are having a fine time showing their contempt for real learning. I say the logic of test-taking is more valuable a skill than dividing polynomial functions by hand, and as long as we have the latter, we had better be teaching the former.

Anonymous said...

I really appreciate this post. Just today I had a conversation with a colleague about the issues related to the MSP (it’s testing week at our elementary). What, really, do we want to know? If we really want to know how the children in Washington are doing, we simply need to ask their teachers. The teachers I work with don’t need test scores to know how their children are performing. We assess that every day. The problem is that the legislature and others don’t trust teachers.

My son is in second grade. I don’t need to look at his report card to know how he is doing in school. I attend his parent-teacher conference twice a year and I see the papers he brings home. I haven’t looked at his report card from the last trimester because it doesn’t give me any information I don’t already know. It’s a paperwork requirement. I know I will hear from his teacher if there is a significant issue.

I work in special education. The MSP is particularly brutal for this population. The state and feds know that this student is on an IEP. The student has IEP goals and we report progress on these IEP goals every trimester. What is a score on the MSP going to tell us? Bottom line, the time and money spent on the MSP would be better spent on giving our students the time (and the school money) needed to actually give them the education necessary to make it in this crazy world. It’s time to stop the testing and start trusting teachers. Send us a survey or call us up, we’ll give you more information about our students’ aptitude than any test will ever inform.