Teacher evaluations based on test scores: Part 2

I received so much information in response to my query about teacher evaluations based on test scores that I decided to provide the information in three posts. The first post can be found at Washington State Bill Proposals HB 2427/SB 6203 and HB 2451: Teacher evaluations based on test scores. Parents around the country respond with their own experiences.

In this post, we will look at what researchers have to say on the subject.

Linda-Darling Hammond at Stanford University responded to my query with a paper titled Getting Teacher Evaluation Right: A Background Paper for Policy Makers which she had researched and written along with Audrey Amrein-Beardsley at Arizona State University, Edward H. Haertel at Stanford University and Jesse Rothstein at the University of California, Berkeley. The Executive Summary states:

Consensus that current teacher evaluation systems often do little to help teachers improve or to support personnel decision making has led to a range of new approaches to teacher evaluation. This brief looks at the available research about teacher evaluation strategies and their impacts on teaching and learning.

Prominent among these new approaches are value-added models (VAM) for examining changes in student test scores over time. These models control for prior scores and some student characteristics known to be related to achievement when looking at score gains. When linked to individual teachers, they are sometimes promoted as measuring teacher ―effectiveness.

Drawing this conclusion, however, assumes that student learning is measured well by a given test, is influenced by the teacher alone, and is independent of other aspects of the classroom context.

Because these assumptions are problematic, researchers have documented problems with value-added models as measures of teachers‘ effectiveness. These include the facts that:

1. Value-Added Models of Teacher Effectiveness Are Highly Unstable: Teachers‘ ratings differ substantially from class to class and from year to year, as well as from one test to the next.

2. Teachers’ Value-Added Ratings Are Significantly Affected by Differences in the Students Who Are Assigned to Them: Even when models try to control for prior achievement and student demographic variables, teachers are advantaged or disadvantaged based on the students they teach. In particular, teachers with large numbers of new English learners and others with special needs have been found to show lower gains than the same teachers when they are teaching other students.

3. Value-Added Ratings Cannot Disentangle the Many Influences on Student Progress: Many other home, school, and student factors influence student learning gains, and these matter more than the individual teacher in explaining changes in scores.

Other tools have been found to be more stable. Some have been found both to predict teacher effectiveness and to help improve teachers’ practice. These include:

• Performance assessments for licensure and advanced certification that are based on professional teaching standards, such as National Board Certification and beginning teacher performance assessments in states like California and Connecticut.

• On-the-job evaluation tools that include structured observations, classroom artifacts, analysis of student learning, and frequent feedback based on professional standards.

In addition to the use of well-grounded instruments, research has found benefits of systems that recognize teacher collaboration, which supports greater student learning.

Finally, systems are found to be more effective when they ensure that evaluators are well-trained, evaluation and feedback are frequent, mentoring and coaching are available, and processes, such as Peer Assistance and Review systems, are in place to support due process and timely decision making by an appropriate body.

And an excerpt from the final summary:

With respect to value-added measures of student achievement tied to individual teachers, current research suggests that high-stakes, individual-level decisions, as well as comparisons across highly dissimilar schools or student populations, should be avoided. Valid interpretations require aggregate-level data and should ensure that background factors, including overall classroom.

And this from the Economic Policy Institute (EPI) Problems with the Use of
Student Test Scores to Evaluate Teachers
, an excerpt from the summary:

While those who evaluate teachers could take student test scores over time into account, they should be fully aware of their limitations, and such scores should be only one element among many considered in teacher profiles.

Some states are now considering plans that would give as much as 50% of the weight in teacher evaluation and compensation decisions to scores on existing poor-quality tests of basic skills in math and reading. Based on the evidence we have reviewed above, we consider this unwise.

If the quality, coverage, and design of standardized tests

were to improve, some concerns would be addressed, but the serious problems of attribution and nonrandom assignment of students, as well as the practical problems described above, would still argue for serious limits on the use of test scores for teacher evaluation.

And in the Review of “Learning About Teaching” by Jesse Rothstein for the National Education Policy Center (NEPC):

The Bill & Melinda Gates Foundation’s “Measures of Effective Teaching” (MET) Project seeks to validate the use of a teacher’s estimated “value-added”—computed from the year-on-year test score gains of her students—as a measure of teaching effectiveness. Using data from six school districts, the initial report examines correlations between student survey responses and value-added scores computed both from state tests and from higher-order tests of conceptual understanding. The study finds that the measures are related, but only modestly. The report interprets this as support for the use of value-added as the basis for teacher evaluations. This conclusion is unsupported, as the data in fact indicate that a teachers’ value-added for the state test is not strongly related to her effectiveness in a broader sense. Most notably, value-added for state assessments is correlated 0.5 or less with that for the alternative assessments, meaning that many teachers whose value-added for one test is low are in fact quite effective when judged by the other. As there is every reason to think that the problems with value-added measures apparent in the MET data would be worse in a high-stakes environment, the MET results are sobering about the value of student achievement data as a significant component of teacher evaluations.

I will end this look at research and policy papers with two personal views.

First from a bookstore owner,  Picture Books No Longer a Staple for Children:

The economic downturn is certainly a major factor, but many in the industry see an additional reason for the slump. Parents have begun pressing their kindergartners and first graders to leave the picture book behind and move on to more text-heavy chapter books. Publishers cite pressures from parents who are mindful of increasingly rigorous standardized testing in schools.

“Parents are saying, ‘My kid doesn’t need books with pictures anymore,’ ” said Justin Chanda, the publisher of Simon & Schuster Books for Young Readers. “There’s a real push with parents and schools to have kids start reading big-kid books earlier. We’ve accelerated the graduation rate out of picture books.”

And from a teacher’s point of view in, Why I Will Not Teach to the Test, to follow is an excerpt:

In the midst of controversy surrounding “value added” teacher assessment, which flared recently following the Los Angeles Times’ public teacher rankings, the real issue is often overlooked: The state tests being used to evaluate student progress—and, in turn, the effectiveness of teachers—virtually ensure mediocrity.

Consider the following California 10th-grade-history standard: “Relate the moral and ethical principles in ancient Greek and Roman philosophy, in Judaism, and in Christianity to the development of Western political thought.” How long do you think it would take to teach this standard before a classroom of 16-year-olds reached a thorough understanding? Weeks? Months? Consider another social studies standard: “Compare and contrast the Glorious Revolution of England, the American Revolution, and the French Revolution and their enduring effects worldwide on the political expectations for self-government and individual liberty.” How much time for this unit? A semester? A year? I am sure that history teachers would love to have the opportunity to delve deeply into these standards, but the state test does not permit deeper instruction. Why? Because these two standards come from a much longer list of standards that will be measured on the exam. Teachers in California know the results of this exam may now be used as a factor in their evaluations, so they are forced to accelerate their instruction into “sprint and cover” mode.

My highest priority is to design lessons that enable my kids to think critically and to give them the skills they will need to live productive lives. I want my students to grow up to be problem-solvers, not test-takers. I want them to be innovators, not automatons.

What harm comes from a sprint-and-cover approach? A study published in the journal Science Education in December 2008 looked at two sets of high school science students. One set “sprinted”; the other set had teachers who slowed down, went deeper, and did not cover as much material. The results? The first group of students actually scored higher on the state tests at the end of the year. This is not surprising, as their teachers covered more of the test material. I am sure it made their parents, teachers, and administrators happy. What is more interesting, however, is that the students who learned through the slower, in-depth approach actually earned higher grades once they made it to college. This, too, is not surprising. These students were taught to think critically.

Dora