Sunday, September 16, 2012

The Dutch elections and opinion polls: Size is important.

The 12th of September was election day for The Netherlands. A lot has been said on the opinion polls that were abundantly present in the media this year. Here are a couple of my thoughts.

To start with, the media, or at least a part of it, was very quick in dismissing the polls. Peter Vandermeersch of the NRC newspaper tweeted that "the first casualties of the elections in The Netherlands were the polls. Can we stop with this now and forever?".  His reaction was remarkable in that he made it at 9.40 pm, at a moment when the actual results were not in yet. He must have based his reaction on ... the Exit Poll that was presented at 9.00pm.

Another  thing that strikes me is that, unlike many other countries, The Netherlands has a tradition in reporting opinion polls in terms of the number of seats a political party will get, rather than simple proportions of the electorate. As a consequence, opinion polls are judged in terms of the number of seats they got wrong.  The excellent website, www.peilloos.nl, has an overview here. The question is, is this fair? Clearly, the number of seats they got wrong is, politically speaking, an easy to interpret measure stick. Let's try an formalize this a little bit. Suppose we have $P$ parties, and $S$ seats to distribute. Call $s_p$ the estimated number of seats for party $p$, and $E(s_p)$, the actual results of the elections.  We then can write the usual Dutch measure stick as:
$$D=\sum_{p=1}^P |s_p - E(s_p)|$$
It is not clear what value of $D$ is considered to be acceptable. Apparently, for this election all $D$'s where higher than during the last elections. Basically an opinion poll is judged based on how politically substantial the difference is. From a political point of view, I think this is fair.
But what happens if we look at this from a statistical point of view? To start with, instead of taking the absolute value, typically differences are squared. This amounts to penalizing larger differences more than smaller differences. Furthermore the (squared) differences are expressed relatively to the actual results. Indeed, a difference of 1 seat is more important for a smaller party than it is for a bigger party. Finally, statisticians will appreciate that somehow you need to take sample size into account, and therefore would rather use the number of people in the sample that have indicated that they would vote for a certain party. Let's call the number of respondents in the sample of size $n$ that they would vote for party $p$, $f_p$. The actual election results then need to be rescaled to the same total sample size as $E(f_p)$, with:
$$E(f_p) \approx {E(s_p) \over S}\times n.$$
Notice that we are making a rather big assumption here, and hence the use of the$\approx$ sign rather than a $=$ sign: we assume the seats are allocated proportionally to the votes received. Formally speaking this assumption is equivalent to assuming a Gallagher Index close to zero:
$$G=\sqrt{{1\over 2} \sum_{p=1}^P(V_p-{E(s_p) \over S}\times 100)^2} \approx 0,$$
with $V_p$ being the percentage of votes for party $p$. I'm not a political scientist, let alone an expert in the Dutch electoral system, so I have no clue whether this assumption is valid or not.

Statistically inclined readers have by now of course realized that I'm taking about the $\chi^2$ test-statistic:
$$\chi^2=\sum_{p=1}^P {(f_p - E(f_p))^2 \over E(f_p)}.$$
The advantage of using $\chi^2$is that it has nice statistical properties that allows you to more easily calculate probabilities (amongst others). That way you can take away some of the subjectivity involved in interpreting $D$. The price you pay for that is that the $\chi^2$-measure itself is probably more difficult to interpret than $D$. But other than that we see that there is not too much difference in the approaches used by statisticians and media folks.

Let's look at the last opinion polls in The Netherlands on the 11th of September, just right before the elections. First let's consider the measure $D$ used in the Dutch media. As said, www.peilloos.nl, has an overview here.In the bottom right corner of the overview table you will see that both "De Stemming/Intomart Gfk" and "TNS NIPO" got 24 seats wrong. "Politieke Barometer/Ipsos Synovate" and "Maurice de Hond/Peil.nl" were wrong on 18 seats. The Exit Poll ("Ipsos Synovate") were closest with 6 seats wrong.
Now let's see what happens if we use a more traditional statistical criterion rather than $D$. As said, the properties of the $\chi^2$ test-statistic allow us to easily calculate some probabilities that allows us to judge whether the observed differences are significant or not (given the sample size). Unfortunately I could not immediately find the sample sizes used on 11/9 of all polls. I'm sure the folks of www.peilloos.nl have them somewhere on their site, but instead of looking for them I re-expressed the problem as follows:
What would be a maximum sample size that would allow us to say that the observed differences between the election outcome and the prediction from the poll could reasonably be attributed to sample variation.
Intuitively we see that if we have a very small sample, say about 50, even big differences would seem acceptable in that they can be attributed to sample variation. Likewise, with very big samples, say a few thousand, we would expect sample variation to play much less of a role, and hence we would expect a much better result and thus accept less differences than in the smaller samples. Somewhere in between lies a threshold sample size for which we can say that the observed differences can reasonably be attributed to sample variation for all sample sizes smaller or equal to the threshold sample size. We define "reasonably" as the sample size having a $p$-value of 0.05 associated with the $\chi^2$ test-statistic. Notice that we assume simple random sampling, regardless of the actual sampling method used in the different polls. The Exit poll, for instance, is, as far as I understand it, a type of cluster sample.

I'm sure you can calculate this analytically, but what I've done is taking the results of the 5 polls, and estimated the number of respondents for each of the $P$ parties based the number of seats assuming a Gallagher Index of 0. I've done that for all sample sizes ($n$) between 50 and 4000. Then I have done a $\chi^2$ goodness-of-fit test, using the rescaled actual results as the pilot-population. This resulted in 5 $\times$ 3951 $p$-values that I've then plotted in function of $n$. You can see the results below:
By definition the line don't cross, so we can say that if the polls all had the same sample size (and assuming a simple random sample), the exit poll would clearly be better than all others. From the polls of the 11th of September, Maurice De Hond seems to be somewhat better than the one from Ipsos Synovate, while using the $D$-criterion both scored 18. The two with the highest score for $D$, i.e. De Stemming and TNS NIPO also scored the worst using the $\chi^2$ criterion, but with TNS NIPO scoring somewhat better. The differences between the two measures are caused by the different way bigger differences are accounted for and because of the relative expression of the difference in the case of the $\chi^2$.
Another way of interpreting this graph is as follows:
If the polls (the exit-poll excluded)  were the result of taking a simple random sample of around 500 or less we should have concluded that the observed differences could be attributed to sample variation and hence were reasonable given that sample size. For sample sizes above 500 we can say that other factors than sample variation should be called in to explain the differences. There are some marginal differences between the 4 polls in that the threshold sample size for "De Stemming" is about 400 and the threshold sample size for Maurice De Hond is around 700.
For the exit poll we can conclude that, if it was taken with a simple random sample (which is not the case) larger than about 3500, we should also have concluded that something else than sample variation was going on.
The Exit poll is, as I understand it, based on a cluster sample of about 40 electoral districts in which voters are asked to "redo" their vote. I understand there are over 40000 participants in the Exit poll. Clearly neither the sample size of 40 nor the sample size of 40000 can be used in this exercise because of the clustering effect. Other than that, for those that have used random sampling, the actual sample size should be used to evaluate the actual performance. But notice also that for all but the exit poll, all differences vanish after a sample of about 1000. As I suspect that all polls have sample sizes of at least 1000 we can indeed conclude that the argument of sample variation can"t be used to explain the differences with the actual election results.

There are a lot of assumptions in this analysis but nonetheless, I believe that the theoretical figure of 3500 for the exit poll, and 500 for all others, allow us to better appreciate what the role of randomness in polling can amount to.

1. $0.001 -$0.02 per click.