Sunday, June 10, 2012

Uplace and the binomial distribution

Currently there is a debate in Belgium about the construction of Uplace, a huge shopping mall on the outskirts of Brussels. A marketing professor of the Vlerick School, Gino Van Ossel, did a survey which showed that, contrary to popular believe, quite a big group was actually in favor rather than opposing the plans. So far so good, but the professor's methodology was questioned in the media. The arguments used against the survey findings were not very strong, I believe, and I will not discuss them here.

However, a part of the reasoning used by Gino Van Ossel looked rather odd to me and made me think about a more general problem that I would like to discuss here. Those of you who understand Dutch can find all the details on In short, he found that in a sample of 654, representing the total Belgium, 33% was in favor, while in the region where the shopping mall would located, with a sample of 182, 46% were in favor. This was against the popular believe that people in the neighboorhood were strongly against the shopping mall. As a consequence, all kinds of arguments were used to undermine the study. Most of those arguments were not very convincing, in my opinion. One of those arguments was that based on about 80 persons in favor of the shopping mall  (46% of 182) you can't make general statements about that part of Belgium. Of course from a statistical perspective you can if you accept a certain precision with a certain confidence.

The part that sparked my attention, however, was when Gino Van Ossel referred to an election study with a sample of 1024 in which statements were made about one particular electorate (the Green Party) which holds about 9% of the votes, or 92 in that sample. His argument was that if you accept that statements are being made based on such a small number of people in this study, you should also accept statements from other studies using a similarly small number of people.

And that's the point where I don't agree anymore. Obviously, given a certain confidence, other than the sample size, the accuracy will also depend on the proportion of the successes. Let's try to formulate that a bit more formally. Assume we have a small sample $n_1$ and a large sample $n_2$ from the same population ($n_1 < n_2$), but the number of successes is equal ($S_1=S_2=S$). The question is what happens with the standard errror in both cases? Obviously $p_1={S \over n_1}>p_2={S\over n_2}$. For simplicity's sake we will, for the moment, assume that both $p_1$ and $p_2<0.50$. On the one hand we can say that as $n_1<n_2$ the standard error in the first case will be larger than in the second case ($SE_1>SE_2$). Moreover, as $p_1>p_2$ and assuming that both $p_1$ and $p_2<0.50$ the standard error in the first case will be larger as well ($SE_1>SE_2$). 

Formally, we can say that:
$$SE_1=\sqrt{p_1(1-p_1)\over n_1}$$ and
$$SE_2=\sqrt{p_2(1-p_2)\over n_2},$$
with $SE_1$ and $SE_2$ representing the standard errors of the two cases. Let's now consider the ratio of these two standard errors:
{SE_2\over SE_1}={\sqrt{p_2(1-p_2)\over n_2}\over\sqrt{p_1(1-p_1)\over n_1}}
For convenience's sake we'll square both sides and re-express
{SE_2^2\over SE_1^2}={ n_1 p_2(1-p_2)\over n_2 p_1(1-p_1)}
 Let's call the ratio of $n_2$ and $n_1$, $k$, so that we can express $p_2={S\over k n_1}={p_1 \over k}$. This yields:
$${SE_2^2\over SE_1^2}={ {p_1 \over k}(1-{p_1 \over k}) \over k p_1(1-p_1)} $$
$${SE_2^2\over SE_1^2}={ 1-{p_1 \over k} \over k^2 (1-p_1)} $$
$${SE_2^2\over SE_1^2}={ k-p_1 \over k^3 (1-p_1)} $$
$${SE_2\over SE_1}=\sqrt{ k-p_1 \over k ^3(1-p_1)} $$
Thus in this case we can express the gain or loss in precision in terms of the initial proportion $p_1$ and the relative sample size. In words we can say that the gain or loss in precision is equal to the square root of the difference between the sample ratio and the initial probability of a success divided by the product of the third power of the sample ratio and the probability of a failure.

We'll use an example that is relatively close to the example discussed by Gino Van Ossel, $n_1=200$, $n_2=1000$. $S_1=S_2=S=80$, and thus $p_1=0.40$ and $p_2=0.08$.$k=5$. Plugging those numbers in the formula yields 0.2477. So, a statement on the proportion of respondents opting fro the Green party will have a precision that is about 4 times better than a statement about the proportion of respondents that are in favor of Uplace, even though the number of 'successes' is close to each other.

Since in this case we want to consider two samples with the same number of successes, we can, from a binomial distribution perspective, reformulate the problem as follows: what is the change in standard error if we increase the sample size, but keep the number of successes constant. In other words, what happens  if we only add failures, not successes. Likewise we can also consider the case where relative to the original situation we decrease the sample, but by taking away the failures (and thus leaving the number of successes constant).  
 We can also see that if $p_1>k$ the ratio is not defined. As $p_1\le1$ this will only happen when $k<1$, i.e. when we decrease the sample size rather than increasing it. Since we are keeping the number of successes constant we can't decrease any further as soon as $p_1>k$.  

Of course, we're not suggesting to follow this procedure in practice as it would introduce bias, but it helps explaining why the comparison made by Prof. Van Ossel is not warranted.

That said, we can also think of what happens if we let the value of $p_1$ vary between 0 and 1. Similarly we can inspect what happens as $k$ goes from 0 to 1, i.e. the case of sample sizes smaller than the original, followed by what happens as $k$ increases to plus infinity, i.e. the case of an increasing sample size.  
The picture above illustrates this graphically. In this case we let $k$ only vary from 0 to 4. So, left of the line where $k=1$ we effectively look at cases where the sample is decreased by taking out failures (and thus leaving the number of successes constant). It will come to no surprise that standard deviation will decrease as we increase the sample. As the initial probability $p_1$ becomes higher we will see that the ratio is becoming undefined, and hence not drawn.
It might be difficult to see, but obviously where $k=1$ the ratio of the standard deviations is always 1. As we move to values of $k$ higher than 1, the value of the ratio falls below 1, indicating that increasing the sample size generally decreases the standard error. But not always. At very high levels of $p_1$, adding a failure can actually increase the standard error

That said, to come back to the initial problem regarding the Uplace shopping mall, the above illustrates that in comparing a sample of about 1000 with about 80 successes, with the case of 80 successes in 200 observations you should take into consideration that the precision of the former is 4 times higher than the latter.


  1. The formula are not showing properly on my iphone.

  2. I like how a simple observation in a news provoked you to analyze and demonstrate the logic behind this. Wish more people did this :)