Thursday, September 27, 2012

The road safety chart in De Morgen 27/09/2012

The front page of "De Morgen" today (27/09/2012) features an article on public transportation and its relationship with road safety. The headline says "The less public transport, the more victims in traffic". To support this the frontpage shows a chart that I found difficult to understand. I had to look at it for several minutes before I understood what was going on. But I'll let you be the judge:

First let me say that I applaud the fact that such an important subject is covered on the front page. Secondly, I'm always happy when I see that an article supported by statistical material is so prominent in the news. That said, there are quite a few problems with the chart:
  • These circles represent the different statistics, however each has its own scale so that you need to be very careful how you compare them.
  • The independent variables (different measures of public transport use in the cities)   are ordered from small to large, except the first one (percentage of inhabitants using a car for work and school related trips). I guess  that's because all other independent variables measure the popularity of public transport in some way and the journalist wanted to have a hitparade of public transport friendly cities. 
  • The dependent variable is the number of victims per 10,000 inhabitants. The pies are all bigger, although there is no reason to do so, other than that it is the variable of interest. 
  • The dependent variable is not ordered from low to high or from high to low, but the position of the pies represents latitude and longitude. It took me a while to realize that because there's no underlying map of Belgium or Flanders behind it. 
  • The colors represent the cities. There are more colors representing the number of victims than feature for the dependent variables.
  • Even for a trained eye it is not at all obvious that there is a strong correlation between use of public transport and road safety.  
  • I'm not a transport expert, but I seriously doubt that the % of inhabitants with a subscription or a public transport service is a valid indicator for Hasselt. That city has basically free public transportation. So either that number is not correct, or either it is low because public transport is free for the inhabitants of the city of Hasselt. 
By the way, I don't question the thesis that good public transportation has an impact on road safety measured by the number of casualties, I just think the graph is not clear and does not support the thesis as far as I can judge.

Is there a better way of  representing the data? Hardly. In my mind, a simple table still does the best in this case:

At least it reveals that lots of figures are missing. Another, more colourful option is a series of horizontal barcharts with the bars sorted in the main variable of interest (i.e. victims). In this case I'm using the complement of the percentage of inhabitants that use a car for work or school, so that all independent variables have the same direction. The graph is produced with Tableau, the visualization software that iVOX, the company I work for, is experimenting with for its reporting needs.  
This chart shows that the correlation that the journalist wants to convey is not that clearly present.

Finally,  you could assume that all independent variables are indicators of a latent variable that represents the usage and availability of public transport. There are many statistical techniques that can be used for that. In this case, with such a low number of observations and so many missing values, I prefer to use a very simple approach:
  • First I'm using the complement  of  the percentage of inhabitants that use a car for school or work
  • Secondly I have rescaled all independent variables in z-scores.  z-scores are obtained by subtracting the average from each observation and dividing by the standard deviation. Now they can be compared between cities and over the different measures. Negative z-scores are values below the average and positive z-scores are above the average. 
  • To resolve the problem with the missing values I have calculated an overall score that measures the "public sector"- friendlyness of a city by taking the median over all non-missing z-values. 
The correlation between that "public transportation" compound variable and the relative number of road victims is -.21. So it is in the predicted direction, but it is also low. It means that variation in our compound score of public transport only explains about 4.5% of the variation in road safety.

Finally, we've plotted the road safety figures in a map. The surface of the circles represents the relative number of victims. And the colorcoding represents the compound "public transportation" variable, with red representing a low value and red a high value on public transportation use in a city taken over all indicators. The values in the middle or grey. Cities that did not have any indicators for public transportation, but only a value for road safety are grey as well.   
The map shows that,based on the data shown by the journalist, there is some variation in road safety, but that public transportation only plays a minor role to explain this.
Again, it's perfectly possible that there is a relationship between use of public transportation and road safety. It's just that the material presented in the article in De Morgen today did not support this conclusion very strongly. Furthermore the way the statistics were shown created more confusion than it helped supporting the claim of the journalist . 

Sunday, September 16, 2012

The Dutch elections and opinion polls: Size is important.

The 12th of September was election day for The Netherlands. A lot has been said on the opinion polls that were abundantly present in the media this year. Here are a couple of my thoughts.

To start with, the media, or at least a part of it, was very quick in dismissing the polls. Peter Vandermeersch of the NRC newspaper tweeted that "the first casualties of the elections in The Netherlands were the polls. Can we stop with this now and forever?".  His reaction was remarkable in that he made it at 9.40 pm, at a moment when the actual results were not in yet. He must have based his reaction on ... the Exit Poll that was presented at 9.00pm.

Another  thing that strikes me is that, unlike many other countries, The Netherlands has a tradition in reporting opinion polls in terms of the number of seats a political party will get, rather than simple proportions of the electorate. As a consequence, opinion polls are judged in terms of the number of seats they got wrong.  The excellent website, www.peilloos.nl, has an overview here. The question is, is this fair? Clearly, the number of seats they got wrong is, politically speaking, an easy to interpret measure stick. Let's try an formalize this a little bit. Suppose we have $P$ parties, and $S$ seats to distribute. Call $s_p$ the estimated number of seats for party $p$, and $E(s_p)$, the actual results of the elections.  We then can write the usual Dutch measure stick as:
$$
D=\sum_{p=1}^P |s_p - E(s_p)|
$$
It is not clear what value of $D$ is considered to be acceptable. Apparently, for this election all $D$'s where higher than during the last elections. Basically an opinion poll is judged based on how politically substantial the difference is. From a political point of view, I think this is fair.
But what happens if we look at this from a statistical point of view? To start with, instead of taking the absolute value, typically differences are squared. This amounts to penalizing larger differences more than smaller differences. Furthermore the (squared) differences are expressed relatively to the actual results. Indeed, a difference of 1 seat is more important for a smaller party than it is for a bigger party. Finally, statisticians will appreciate that somehow you need to take sample size into account, and therefore would rather use the number of people in the sample that have indicated that they would vote for a certain party. Let's call the number of respondents in the sample of size $n$ that they would vote for party $p$, $f_p$. The actual election results then need to be rescaled to the same total sample size as $E(f_p)$, with:
$$
 E(f_p) \approx {E(s_p) \over S}\times n.
$$ 
Notice that we are making a rather big assumption here, and hence the use of the$\approx$ sign rather than a $=$ sign: we assume the seats are allocated proportionally to the votes received. Formally speaking this assumption is equivalent to assuming a Gallagher Index close to zero:
$$
G=\sqrt{{1\over 2} \sum_{p=1}^P(V_p-{E(s_p) \over S}\times 100)^2} \approx 0,
$$
with $V_p$ being the percentage of votes for party $p$. I'm not a political scientist, let alone an expert in the Dutch electoral system, so I have no clue whether this assumption is valid or not.

Statistically inclined readers have by now of course realized that I'm taking about the $\chi^2$ test-statistic:
$$
\chi^2=\sum_{p=1}^P {(f_p - E(f_p))^2 \over E(f_p)}.
$$ 
The advantage of using $\chi^2$is that it has nice statistical properties that allows you to more easily calculate probabilities (amongst others). That way you can take away some of the subjectivity involved in interpreting $D$. The price you pay for that is that the $\chi^2$-measure itself is probably more difficult to interpret than $D$. But other than that we see that there is not too much difference in the approaches used by statisticians and media folks.

Let's look at the last opinion polls in The Netherlands on the 11th of September, just right before the elections. First let's consider the measure $D$ used in the Dutch media. As said, www.peilloos.nl, has an overview here.In the bottom right corner of the overview table you will see that both "De Stemming/Intomart Gfk" and "TNS NIPO" got 24 seats wrong. "Politieke Barometer/Ipsos Synovate" and "Maurice de Hond/Peil.nl" were wrong on 18 seats. The Exit Poll ("Ipsos Synovate") were closest with 6 seats wrong.  
Now let's see what happens if we use a more traditional statistical criterion rather than $D$. As said, the properties of the $\chi^2$ test-statistic allow us to easily calculate some probabilities that allows us to judge whether the observed differences are significant or not (given the sample size). Unfortunately I could not immediately find the sample sizes used on 11/9 of all polls. I'm sure the folks of www.peilloos.nl have them somewhere on their site, but instead of looking for them I re-expressed the problem as follows:
What would be a maximum sample size that would allow us to say that the observed differences between the election outcome and the prediction from the poll could reasonably be attributed to sample variation.  
Intuitively we see that if we have a very small sample, say about 50, even big differences would seem acceptable in that they can be attributed to sample variation. Likewise, with very big samples, say a few thousand, we would expect sample variation to play much less of a role, and hence we would expect a much better result and thus accept less differences than in the smaller samples. Somewhere in between lies a threshold sample size for which we can say that the observed differences can reasonably be attributed to sample variation for all sample sizes smaller or equal to the threshold sample size. We define "reasonably" as the sample size having a $p$-value of 0.05 associated with the $\chi^2$ test-statistic. Notice that we assume simple random sampling, regardless of the actual sampling method used in the different polls. The Exit poll, for instance, is, as far as I understand it, a type of cluster sample.   

I'm sure you can calculate this analytically, but what I've done is taking the results of the 5 polls, and estimated the number of respondents for each of the $P$ parties based the number of seats assuming a Gallagher Index of 0. I've done that for all sample sizes ($n$) between 50 and 4000. Then I have done a $\chi^2$ goodness-of-fit test, using the rescaled actual results as the pilot-population. This resulted in 5 $\times$ 3951 $p$-values that I've then plotted in function of $n$. You can see the results below: 
By definition the line don't cross, so we can say that if the polls all had the same sample size (and assuming a simple random sample), the exit poll would clearly be better than all others. From the polls of the 11th of September, Maurice De Hond seems to be somewhat better than the one from Ipsos Synovate, while using the $D$-criterion both scored 18. The two with the highest score for $D$, i.e. De Stemming and TNS NIPO also scored the worst using the $\chi^2$ criterion, but with TNS NIPO scoring somewhat better. The differences between the two measures are caused by the different way bigger differences are accounted for and because of the relative expression of the difference in the case of the $\chi^2$.
Another way of interpreting this graph is as follows:
If the polls (the exit-poll excluded)  were the result of taking a simple random sample of around 500 or less we should have concluded that the observed differences could be attributed to sample variation and hence were reasonable given that sample size. For sample sizes above 500 we can say that other factors than sample variation should be called in to explain the differences. There are some marginal differences between the 4 polls in that the threshold sample size for "De Stemming" is about 400 and the threshold sample size for Maurice De Hond is around 700.
For the exit poll we can conclude that, if it was taken with a simple random sample (which is not the case) larger than about 3500, we should also have concluded that something else than sample variation was going on.
The Exit poll is, as I understand it, based on a cluster sample of about 40 electoral districts in which voters are asked to "redo" their vote. I understand there are over 40000 participants in the Exit poll. Clearly neither the sample size of 40 nor the sample size of 40000 can be used in this exercise because of the clustering effect. Other than that, for those that have used random sampling, the actual sample size should be used to evaluate the actual performance. But notice also that for all but the exit poll, all differences vanish after a sample of about 1000. As I suspect that all polls have sample sizes of at least 1000 we can indeed conclude that the argument of sample variation can"t be used to explain the differences with the actual election results.

There are a lot of assumptions in this analysis but nonetheless, I believe that the theoretical figure of 3500 for the exit poll, and 500 for all others, allow us to better appreciate what the role of randomness in polling can amount to.