Regraph of the day: Gates Foundation on teacher evaluation


Having produced a graph or two, I’m aware of the challenges that come with presenting data fairly. Choices need to be made about what type of graph to use, how to scale it, and so on. As such, I’ve grown sensitive to the choices others make in presenting graphs. Today I’m going to tackle a recent problematic example.

Let’s start with the graph:

(Image originally from the Gates Foundation’s Measures of Effective Teaching Project.)

This is from the Gates Foundation, which wants to design a better teacher evaluation system. In the words of their report, this graph of English/Language Arts scores “compare[s] the actual 2010-11 school year achievement gains for randomly assigned classrooms with the results that were predicted based on the earlier measures of teaching effectiveness.” They conclude that their predictions were fairly accurate. Let’s dive into that.

The first thing to notice is the scale of each axis. They’re measuring the same thing (standard deviations in test scores), yet the horizontal axis is much more stretched than the vertical axis. This exaggerates the apparent effect. To see what I mean, here’s a rescaled version of the graph that puts the two axes in correct proportion.

Things don’t look quite so dramatic anymore. Now, check out that white line. A line on a scatter plot is usually a line of best fit. Not this time. That’s the predicted line, meaning the line that the data should be following. Of course, including it encourages your mind to see the dots as following the line. So let’s take that out.

The next thing I want to point out is those two extreme high and low points. Those two extremes (each representing 5% of teachers) encourage you to see the rest of the dots as connecting the two. But what happens if we take those out, leaving the middle 90% of teachers?

We’ve got a muddled middle, suggesting these predictions aren’t very good for 9 out of 10 teachers. One last point: Condensing the whole data set down to these 5% average points further “cleans” the data. To see what I mean, check out this post by Gary Rubinstein.

All of which is to say we still don’t know much about how to evaluate teachers in a way that predicts test scores. Not that it would likely be a game changer if we did.