Sunday, September 24, 2023

A Frequentist and a Bayesian go to a bar ...

(Note: you might want to refresh this page on your browser if the equations don't render correctly.) 

In the first installment of this blogpost, I illustrated that Fisher's rule of thumb of using $\frac{3}{n}$ for the upper limit of a 95% confidence/credible interval is a good approximation as soon a $n>=25$. This was inspired by a blogpost from John D. Cook on the subject.

At the end I made a remark about something odd that happens when $n=1$. Fisher's rule of thumb results in 1, which is not very informative. The Bionomial solution is 0.95. When $n=1$ this is now an actual Bernoulli, i.e. a special case of the binomial if you will:

$$P(S_1=0)= {1\choose 0}p^0(1-p)^1=0.05$$
$$= 1(1-p)=0.05$$
$$p=1-0.05=0.95.$$
Yet, in the Bayesian analysis, the result is p=0.78. Why?


First let's recalculate that number in an even simpler manual way than I showed in the first installment of this blogpost. We know that the distribution we're interested in is the Bernoulli distribution. The likelihood of a failure in that case is $1-p$. The prior is a uniform distribution. Imagine a grid going from 0 to 1 in steps of 0.0001. The prior distribution will have a constant for all possible values of $p$. For simplicity's sake, let's take 1. As the likelihood is $1-p$, applying this to the grid will yield a series starting from 1, followed by 1-0.0001=0.9999, followed by 0.9998, 0.9997, and so on, down to 0. That last value reflects the fact that for a value of $p$=1, the likelihood of observing a 0 with $p=1$ is 0.  The posterior is then trivially the prior multiplied with the likelihood, which, in this case, is the exact same series of the likelihood. Sampling from this posterior results in 0.77791 or approximately 0.78, as can be verified with the following one-liner:

quantile(x=sample(x=seq(from = 0, to = 1, by = 0.0001), 
                size=10000, 
                replace=TRUE, 
                prob=seq(from = 1, to = 0, by = -0.0001)), 
         probs=.95)

The difference between the Frequentist and Bayesian approach is subtle. Let me illustrate this with the following tale:

Freddy (a Frequentist) and Barry (a Bayesian) go to a bar. After serving them a drink, Ronny, the bartender, has a little quiz for them. They observe one trial, with a failure as the outcome. They don't know $p$, but they need to predict the next outcome. The only thing they know is that there can only be a success (1) or a failure (0), and that the second trial uses the same unknown $p$ as the one from the first trial. Freddy shouts \emph{maximum likelihood} and answers $0$. Barry mumbles something about preferring to answer with a distribution, but the bartender insists on one answer only. Barry then grudgingly agrees and whispers  \emph{Maximum A Posteriori} and answers $0$. They witness a new trial and sure enough the outcome is 0 again. Freddy and Barry do a happy dance and continue drinking. "Not so fast" says the bartender and asks how sure they were after the first trial. Freddy, who is a Frequentist, answers first and says: I have a 95\% confidence interval from 0 to 0.95, so if we were to repeat this exercise 100 times under the same conditions, I would expect that that the true $p$ would be in a similarly constructed confidence interval 95 out of the 100 cases, so I'm pretty sure. Barry, who is a Bayesian, is a bit more thoughtful and takes his time to answer. He jots a few numbers on a napkin and finally says: "I have a 95\% credible interval from 0 to 0.78, so there is a 95\% probability that the true parameter value falls between 0 and 0.78." The bartender now needs to decide who he will crown the winner. Luckily Ronny happens to know some R from a previous job, so he decides to simulate a whole series of quizzes. He heard Barry mention a uniform prior, so he decides, for each simulated quiz, to randomly pick a $p$ from the real line between 0 and 1 with equal probability. Next, just like in the actual quiz, he selects only those trials that have a failure, and for each of these he runs a new trial using the same $p$. He then counts the number of times the second trial is a failure, because that's what Freddy and Barry would predict each time. Finally he expresses the counts in proportions. To avoid any discussion he decides to also consider the complete set of outcomes, i.e. those with a failure in the first trial, and, those with a success in the first trial.

After running 10,000 simulated quizzes Ronny gets a proportion of 0.6602 of quizzes where the second trial was a failure. The proportion  of simulated quizzes where the second trial is a failure irrespective of the outcome of the first trial was 0.5028.

When Ronny sees that the proportion of simulated quizzes where the second trial is a failure irrespective of the outcome of the first trial is approximately 0.5, he quickly realizes why:
$$ E(X)=\int_0^1 E(X|p)f(p)dp,$$
in which $X$ is the Bernoulli random variable,  representing the outcome of the second trial in an experiment, $E(X|p)$ is the expected value of the Bernoulli random variable $X$ given a specific value of $p$, and $f(p)$ is the probability density function of $p$. Because the expected value of a Bernoulli random variable $X$ given a specific value of $p$ is trivially equal to $p$, and since $p$ is uniformly distributed over $[0, 1]$, $f(p)=1$, we now get:
$$E(X)=\int_0^1p \times 1~dp$$
$$ E(X)=\left[\frac{p^2}{2}\right]_0^1=\frac{1^2}{2}-\frac{0^2}{2}=\frac{1}{2}=0.5.$$
So, under these conditions, if you would repeat the quiz many times you would expect to have about as many failures as successes, just like Ronny observed.

If you first only select the cases where the first trial was a failure - just like in the original quiz - things get a tiny bit more complicated.  Let's call the outcome of the first Bernoulli trial $X_1$ and the second $X_2$. We want to know $E(X_2)$ so that we then can derive the (expected) proportion of failures in the second trial. Let's start by  using the law of total expectation:
$$E(X_2)=\int_0^1E(X_2|X_1=1)\times f(p|X_1=1)dp+\int_0^1E(X_2|X_1=0)\times f(p|X_1=0)dp$$
The first term is trivially 0. For the second term we know that $E(X_2|X_1=0)=p$. Let's work out the second part of the second term separately:
$$f(p|X_1=0)=\frac{f(X_1=0|p)\times f(p)}{f(X_1=0)},$$
using Bayes' rule. $f(X_1=0|p)$ is the conditional probability density function of $X_1$ being a failure given $p$, which is $1-p$. In this context $f(X_1=0)$ is the marginal probability density function of $X_1$ being a failure. To express that part let's start with the law of total probability:
$$f(X_1=0)=\int_0^1f(X_1=0|p)\times f(p)dp$$
All elements in that equation were discussed before so we can perform the integration:
$$f(X_1=0)=\int_0^1 (1-p) \times 1 dp=\left[p - \frac{p^2}{2}\right]_0^1=\left[1 - \frac{1}{2}\right] -\left[0 - \frac{0^2}{2}\right]=\frac{1}{2}.$$
All of this leads to:
$$f(p|X_1=0)=\frac{(1-p) \times 1 }{\frac{1}{2}}=2(1-p).$$
Now we can go back to $E(X_2)$ and write:
$$E(X_2)=0+\int_0^1 p \times 2(1-p) dp,$$
$$E(X_2)=2\int_0^1 (p-p^2) dp,$$
$$E(X_2)=2\left[\frac{p^2}{2}-\frac{p^3}{3}\right]_0^1=2 \left[ \left(\frac{1}{2}-\frac{1}{3}\right) - \left(\frac{0}{2}-\frac{0}{3} \right)\right]=2\left[\frac{1}{6}\right]=\frac{1}{3}.$$

If the expectation of $X_2 = \frac{1}{3}$, the probability of a failure for $X_2$ is $\frac{2}{3}=0.6667$, again very close to what Ronny observed.  

Before we continue let me add a comment from Romke Bontekoe on an earlier version of this blogpost. He remarked that Ronny was not the first to figure this out. Pierre-Simon Laplace, who lived from 1749 to 1827, had established the  rule of succession that states that:
$$P(X_{n+1}=1|X_1+X_2+ \dots+X_n=s)=\frac{s+1}{n+2}, $$
in which $s$ is the number of successes and $n$ the number of trials. Applied to this case, where we are interested in the probability of a failure, given 1 previous failure, we get:
$$P(X_2=0| X_1=0)=1-P(X_2=1| X_1=0)$$
$$P(X_2=0| X_1=0)=1-\frac{0+1}{1+2}=1-\frac{1}{3}=\frac{2}{3}.$$

So now Ronny understand where his results are coming from, but does that help him to decide who the winner will be? He decides to look at 95th percentile of the distribution of the generated $p$'s, both when all $p$'s are considered and when only those $p$'s are considered that returned a failure on the first trial. The results are 0.9478 and 0.7829 respectively. These numbers are very close to the 0.95 and 0.78 that Freddy and Ronny had mentioned. Ronny thus concludes that, while both Freddy and Barry answered 0, Barry gets the advantage for properly taking into account the result of the first trial. Freddy didn't learn anything from that first trial. If we would do the same exercise, but we would have witnessed a success in trial 1, and we would only continue with the $p$'s that lead to that success, Freddy would still insist on a 95% confidence interval from 0 to 0.95 for a failure in trial 2, while Barry would adjust his credible interval so that now it would go from 0 to 0.9746 instead of going from 0 to 0.78.

What can this be used for? Not much, I will admit, but it shows that, sometimes, even if you only get a sample of 1 you can already come to some conclusions. 
A second thing we can learn from this is that you don't need the integrals for Bayesian analysis, often you can just rely on simulations or other alternatives. One of the reasons why I only picked up Bayesian statistics at a later age is that when I was younger, as soon as a paper would use integrals instead of summations, I would mentally block and most often give up. But, if you're careful about the books and articles you read, Bayesian statistics is also accessible for people who are less fond of integrals.    


Tuesday, September 19, 2023

A note on observing zero successes

Say that you have a sample of size $n=1000$ and you observed $S_n=100$ successes. Traditionally you would use $\hat p=\frac{S_n}{n}=\frac{100}{1000}=0.1$ as a point estimate of the population proportion $p$. From a frequentist perspective you would probably also report a confidence interval:

$$p_-=\hat p - z_\alpha\sqrt{\frac{\hat p(1-\hat p)}{n}}=0.1-1.96\sqrt{\frac{0.1 \times 0.9}{1000}}=0.08140581,$$

and

$$p_+=\hat p + z_\alpha\sqrt{\frac{\hat p(1-\hat p)}{n}}=0.1-1.96\sqrt{\frac{0.1 \times 0.9}{1000}}=0.1185942,$$

using $z_\alpha=1.96$ for a 95% confidence interval (Assuming that the sample fraction is small, i.e. the universe size $N$ is large relative to $n$. Also, I will not go into how such a confidence interval needs to be interpreted.).

So far, so good. 

Now say you have observed zero successes, i.e. $S_n=0$, and you want to apply the procedure above. To start with, you can't because it violates the non-zero sample proportion assumption.   There are some alternatives such as the  Wilson Score Interval or the Clopper Pearson Interval. I will not discuss those, instead I will focus on how Bayesian analysis can help.

Say that someone wants to know what proportion of human beings can fly and say that you observe zero successes in a sample of $n=1000$. Many people would be willing to conclude that people can't fly. This is partly because people have very strong priors about the human capacity to fly. From an inductive reasoning point of view, you can never, with full certainty, conclude that humans can't fly until you have checked that for each individual in the population. The absence of evidence does not necessarily mean evidence of absence. 

In business statistics we don't have the time and resources to check all individuals of the population, furthermore, the complete absence of something is seldom of great concern from a business perspective. 

Nonetheless, sometimes, even in business, the absence question can be important. Think about the presence or absence of errors, for example. Clearly, in a lot of cases you can't verify all units whether there's an error or not. You then often sample cases. If it is of business interest to know whether the whole population of interest has no errors it's clear you will have to work with confidence intervals, credible intervals or some other way of expressing uncertainty.

What can you do? 

As explained by John D. Cook in his blogpost Estimating the chances of something that hasn’t happened yet (See https://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/), you could argue that we want to find the largest proportion $p$ such that:

$$P(S_n=0)= {n\choose 0}p^0(1-p)^n=0.05$$

or

$$(1-p)^n=0.05.$$

The 0.05 is coming from the fact that we want to have a 95% confidence interval. 

We want to resolve this last equation for $p$. First we take the logarithms at both sides:

$$\log((1-p)^n)=\log(0.05),$$

and we work out further to:

$$n \log(1-p)= -2.995732 \approx -3$$

We then apply the Taylor series expansion for $\log(1-p)$ and get:

$$n \left(-p -\frac{p^2}{2}-\frac{p^3}{3} - \dots \right) \approx -3$$

For small values of $p$  the higher-order terms become negligible and we can truncate the series after the first term, so that:

$$n (-p) \approx -3,$$

which leads to the rule of thumb that:

$$p \approx \frac{3}{n}$$

This rule of thumb is attributed to  Ronald A. Fisher who described it in his book Statistical Methods for Research Worker published in 1925.  

Notice that from a frequentist perspective, more specifically how confidence intervals should be interpreted, this rule of thumb is not without problems. Why this is the case would lead us too far, but there's a Bayesian argument to be made that leads to the same rule of thumb.

In the blogpost I mentioned earlier John D. Cook gives a Bayesian derivation:

Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 – exp(-3) as N gets large, and 1 – exp(-3) $\approx$ 0.95.

This derivation really relies on your knowledge of how to analytically come to an expression for the posterior distribution in this particular case. If you don't have that knowledge I present a more intuitive illustration here. 

We start again from a uniform prior distribution. The likelihood is simply the binomial. We don't know which $p$ to use so we use grid-approximation and use 10,000 values of p evenly distributed between 0 and 1. We multiply prior and likelihood to have an unstandardized posterior. Then we standardize the posterior.  Next we sample from the posterior and we summarize by picking up the 95th percentile.

Below you can find example code that illustrates how simple this idea can be implemented in R:

n<-100

p_grid<-seq(from=0, to=1, length.out=10000)

prior<-rep(1,10000)

likelihood<-dbinom(0, size=n, prob=p_grid)

unstd.posterior<-likelihood*prior

posterior<-unstd.posterior/sum(unstd.posterior)

samples<-sample(p_grid, prob=posterior, size=1e5, replace=TRUE)

q95<-as.numeric(quantile(samples, 0.95))

print(q95)

The result is close to Fisher's rule of thumb $p=\frac{3}{100}=0.03$. 

Next to the Bayesian approach with grid-approximation (abbreviated as Bayes), and Fisher's rule of three (abbreviated as Fisher), we also work out the binomial case (abbreviated as Binomial).

To do this we  go back a few equations a go , and work it out further:

$$\log(1-p)= \frac{-2.995732}{n},$$

and exponentiate both sides:

$$1-p= \exp\left(\frac{-2.995732}{n}\right),$$

which finally leads to:

$$p= 1- \exp\left(\frac{-2.995732}{n}\right).$$

I did this for $n$ going from 1 up to 50 and summarized it in the chart below.


First notice that the blue line with the results of the Bayesian analysis is more wobbly than the others. That's because we're sampling and we're using grid-approximation in that approach.  But for all practical purposes we see that all three approaches are pretty much equal to each other as soon as $n>25$. Finally, the attentive reader will have observed that something odd happens when $n=1$. This will be discussed in follow-up blogpost!