A Frequentist and a Bayesian go to a bar ...

(Note: you might want to refresh this page on your browser if the equations don't render correctly.) 

In the first installment of this blogpost, I illustrated that Fisher's rule of thumb of using $\frac{3}{n}$ for the upper limit of a 95% confidence/credible interval is a good approximation as soon a $n>=25$. This was inspired by a blogpost from John D. Cook on the subject.

At the end I made a remark about something odd that happens when $n=1$. Fisher's rule of thumb results in 1, which is not very informative. The Bionomial solution is 0.95. When $n=1$ this is now an actual Bernoulli, i.e. a special case of the binomial if you will:

$$P(S_1=0)= {1\choose 0}p^0(1-p)^1=0.05$$
$$= 1(1-p)=0.05$$
$$p=1-0.05=0.95.$$
Yet, in the Bayesian analysis, the result is p=0.78. Why?


First let's recalculate that number in an even simpler manual way than I showed in the first installment of this blogpost. We know that the distribution we're interested in is the Bernoulli distribution. The likelihood of a failure in that case is $1-p$. The prior is a uniform distribution. Imagine a grid going from 0 to 1 in steps of 0.0001. The prior distribution will have a constant for all possible values of $p$. For simplicity's sake, let's take 1. As the likelihood is $1-p$, applying this to the grid will yield a series starting from 1, followed by 1-0.0001=0.9999, followed by 0.9998, 0.9997, and so on, down to 0. That last value reflects the fact that for a value of $p$=1, the likelihood of observing a 0 with $p=1$ is 0.  The posterior is then trivially the prior multiplied with the likelihood, which, in this case, is the exact same series of the likelihood. Sampling from this posterior results in 0.77791 or approximately 0.78, as can be verified with the following one-liner:

quantile(x=sample(x=seq(from = 0, to = 1, by = 0.0001), 
                size=10000, 
                replace=TRUE, 
                prob=seq(from = 1, to = 0, by = -0.0001)), 
         probs=.95)

The difference between the Frequentist and Bayesian approach is subtle. Let me illustrate this with the following tale:

Freddy (a Frequentist) and Barry (a Bayesian) go to a bar. After serving them a drink, Ronny, the bartender, has a little quiz for them. They observe one trial, with a failure as the outcome. They don't know $p$, but they need to predict the next outcome. The only thing they know is that there can only be a success (1) or a failure (0), and that the second trial uses the same unknown $p$ as the one from the first trial. Freddy shouts \emph{maximum likelihood} and answers $0$. Barry mumbles something about preferring to answer with a distribution, but the bartender insists on one answer only. Barry then grudgingly agrees and whispers  \emph{Maximum A Posteriori} and answers $0$. They witness a new trial and sure enough the outcome is 0 again. Freddy and Barry do a happy dance and continue drinking. "Not so fast" says the bartender and asks how sure they were after the first trial. Freddy, who is a Frequentist, answers first and says: I have a 95\% confidence interval from 0 to 0.95, so if we were to repeat this exercise 100 times under the same conditions, I would expect that that the true $p$ would be in a similarly constructed confidence interval 95 out of the 100 cases, so I'm pretty sure. Barry, who is a Bayesian, is a bit more thoughtful and takes his time to answer. He jots a few numbers on a napkin and finally says: "I have a 95\% credible interval from 0 to 0.78, so there is a 95\% probability that the true parameter value falls between 0 and 0.78." The bartender now needs to decide who he will crown the winner. Luckily Ronny happens to know some R from a previous job, so he decides to simulate a whole series of quizzes. He heard Barry mention a uniform prior, so he decides, for each simulated quiz, to randomly pick a $p$ from the real line between 0 and 1 with equal probability. Next, just like in the actual quiz, he selects only those trials that have a failure, and for each of these he runs a new trial using the same $p$. He then counts the number of times the second trial is a failure, because that's what Freddy and Barry would predict each time. Finally he expresses the counts in proportions. To avoid any discussion he decides to also consider the complete set of outcomes, i.e. those with a failure in the first trial, and, those with a success in the first trial.

After running 10,000 simulated quizzes Ronny gets a proportion of 0.6602 of quizzes where the second trial was a failure. The proportion  of simulated quizzes where the second trial is a failure irrespective of the outcome of the first trial was 0.5028.

When Ronny sees that the proportion of simulated quizzes where the second trial is a failure irrespective of the outcome of the first trial is approximately 0.5, he quickly realizes why:
$$ E(X)=\int_0^1 E(X|p)f(p)dp,$$
in which $X$ is the Bernoulli random variable,  representing the outcome of the second trial in an experiment, $E(X|p)$ is the expected value of the Bernoulli random variable $X$ given a specific value of $p$, and $f(p)$ is the probability density function of $p$. Because the expected value of a Bernoulli random variable $X$ given a specific value of $p$ is trivially equal to $p$, and since $p$ is uniformly distributed over $[0, 1]$, $f(p)=1$, we now get:
$$E(X)=\int_0^1p \times 1~dp$$
$$ E(X)=\left[\frac{p^2}{2}\right]_0^1=\frac{1^2}{2}-\frac{0^2}{2}=\frac{1}{2}=0.5.$$
So, under these conditions, if you would repeat the quiz many times you would expect to have about as many failures as successes, just like Ronny observed.

If you first only select the cases where the first trial was a failure - just like in the original quiz - things get a tiny bit more complicated.  Let's call the outcome of the first Bernoulli trial $X_1$ and the second $X_2$. We want to know $E(X_2)$ so that we then can derive the (expected) proportion of failures in the second trial. Let's start by  using the law of total expectation:
$$E(X_2)=\int_0^1E(X_2|X_1=1)\times f(p|X_1=1)dp+\int_0^1E(X_2|X_1=0)\times f(p|X_1=0)dp$$
The first term is trivially 0. For the second term we know that $E(X_2|X_1=0)=p$. Let's work out the second part of the second term separately:
$$f(p|X_1=0)=\frac{f(X_1=0|p)\times f(p)}{f(X_1=0)},$$
using Bayes' rule. $f(X_1=0|p)$ is the conditional probability density function of $X_1$ being a failure given $p$, which is $1-p$. In this context $f(X_1=0)$ is the marginal probability density function of $X_1$ being a failure. To express that part let's start with the law of total probability:
$$f(X_1=0)=\int_0^1f(X_1=0|p)\times f(p)dp$$
All elements in that equation were discussed before so we can perform the integration:
$$f(X_1=0)=\int_0^1 (1-p) \times 1 dp=\left[p - \frac{p^2}{2}\right]_0^1=\left[1 - \frac{1}{2}\right] -\left[0 - \frac{0^2}{2}\right]=\frac{1}{2}.$$
All of this leads to:
$$f(p|X_1=0)=\frac{(1-p) \times 1 }{\frac{1}{2}}=2(1-p).$$
Now we can go back to $E(X_2)$ and write:
$$E(X_2)=0+\int_0^1 p \times 2(1-p) dp,$$
$$E(X_2)=2\int_0^1 (p-p^2) dp,$$
$$E(X_2)=2\left[\frac{p^2}{2}-\frac{p^3}{3}\right]_0^1=2 \left[ \left(\frac{1}{2}-\frac{1}{3}\right) - \left(\frac{0}{2}-\frac{0}{3} \right)\right]=2\left[\frac{1}{6}\right]=\frac{1}{3}.$$

If the expectation of $X_2 = \frac{1}{3}$, the probability of a failure for $X_2$ is $\frac{2}{3}=0.6667$, again very close to what Ronny observed.  

Before we continue let me add a comment from Romke Bontekoe on an earlier version of this blogpost. He remarked that Ronny was not the first to figure this out. Pierre-Simon Laplace, who lived from 1749 to 1827, had established the  rule of succession that states that:
$$P(X_{n+1}=1|X_1+X_2+ \dots+X_n=s)=\frac{s+1}{n+2}, $$
in which $s$ is the number of successes and $n$ the number of trials. Applied to this case, where we are interested in the probability of a failure, given 1 previous failure, we get:
$$P(X_2=0| X_1=0)=1-P(X_2=1| X_1=0)$$
$$P(X_2=0| X_1=0)=1-\frac{0+1}{1+2}=1-\frac{1}{3}=\frac{2}{3}.$$

So now Ronny understand where his results are coming from, but does that help him to decide who the winner will be? He decides to look at 95th percentile of the distribution of the generated $p$'s, both when all $p$'s are considered and when only those $p$'s are considered that returned a failure on the first trial. The results are 0.9478 and 0.7829 respectively. These numbers are very close to the 0.95 and 0.78 that Freddy and Ronny had mentioned. Ronny thus concludes that, while both Freddy and Barry answered 0, Barry gets the advantage for properly taking into account the result of the first trial. Freddy didn't learn anything from that first trial. If we would do the same exercise, but we would have witnessed a success in trial 1, and we would only continue with the $p$'s that lead to that success, Freddy would still insist on a 95% confidence interval from 0 to 0.95 for a failure in trial 2, while Barry would adjust his credible interval so that now it would go from 0 to 0.9746 instead of going from 0 to 0.78.

What can this be used for? Not much, I will admit, but it shows that, sometimes, even if you only get a sample of 1 you can already come to some conclusions. 
A second thing we can learn from this is that you don't need the integrals for Bayesian analysis, often you can just rely on simulations or other alternatives. One of the reasons why I only picked up Bayesian statistics at a later age is that when I was younger, as soon as a paper would use integrals instead of summations, I would mentally block and most often give up. But, if you're careful about the books and articles you read, Bayesian statistics is also accessible for people who are less fond of integrals.    


Comments

Popular posts from this blog

(small) samples versus alternative (big) data sources

Market Research and Big Data: A difficult relationship

Why The Nielsen Company is an #mrx Big Data company avant la lettre