A note on observing zero successes

Say that you have a sample of size $n=1000$ and you observed $S_n=100$ successes. Traditionally you would use $\hat p=\frac{S_n}{n}=\frac{100}{1000}=0.1$ as a point estimate of the population proportion $p$. From a frequentist perspective you would probably also report a confidence interval:

$$p_-=\hat p - z_\alpha\sqrt{\frac{\hat p(1-\hat p)}{n}}=0.1-1.96\sqrt{\frac{0.1 \times 0.9}{1000}}=0.08140581,$$

and

$$p_+=\hat p + z_\alpha\sqrt{\frac{\hat p(1-\hat p)}{n}}=0.1-1.96\sqrt{\frac{0.1 \times 0.9}{1000}}=0.1185942,$$

using $z_\alpha=1.96$ for a 95% confidence interval (Assuming that the sample fraction is small, i.e. the universe size $N$ is large relative to $n$. Also, I will not go into how such a confidence interval needs to be interpreted.).

So far, so good.

Now say you have observed zero successes, i.e. $S_n=0$, and you want to apply the procedure above. To start with, you can't because it violates the non-zero sample proportion assumption. There are some alternatives such as the Wilson Score Interval or the Clopper Pearson Interval. I will not discuss those, instead I will focus on how Bayesian analysis can help.

Say that someone wants to know what proportion of human beings can fly and say that you observe zero successes in a sample of $n=1000$. Many people would be willing to conclude that people can't fly. This is partly because people have very strong priors about the human capacity to fly. From an inductive reasoning point of view, you can never, with full certainty, conclude that humans can't fly until you have checked that for each individual in the population. The absence of evidence does not necessarily mean evidence of absence.

In business statistics we don't have the time and resources to check all individuals of the population, furthermore, the complete absence of something is seldom of great concern from a business perspective.

Nonetheless, sometimes, even in business, the absence question can be important. Think about the presence or absence of errors, for example. Clearly, in a lot of cases you can't verify all units whether there's an error or not. You then often sample cases. If it is of business interest to know whether the whole population of interest has no errors it's clear you will have to work with confidence intervals, credible intervals or some other way of expressing uncertainty.

What can you do?

As explained by John D. Cook in his blogpost Estimating the chances of something that hasn’t happened yet (See https://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/), you could argue that we want to find the largest proportion $p$ such that:

$$P(S_n=0)= {n\choose 0}p^0(1-p)^n=0.05$$

$$(1-p)^n=0.05.$$

The 0.05 is coming from the fact that we want to have a 95% confidence interval.

We want to resolve this last equation for $p$. First we take the logarithms at both sides:

$$\log((1-p)^n)=\log(0.05),$$

and we work out further to:

$$n \log(1-p)= -2.995732 \approx -3$$

We then apply the Taylor series expansion for $\log(1-p)$ and get:

$$n \left(-p -\frac{p^2}{2}-\frac{p^3}{3} - \dots \right) \approx -3$$

For small values of $p$ the higher-order terms become negligible and we can truncate the series after the first term, so that:

$$n (-p) \approx -3,$$

which leads to the rule of thumb that:

$$p \approx \frac{3}{n}$$

This rule of thumb is attributed to Ronald A. Fisher who described it in his book Statistical Methods for Research Worker published in 1925.

Notice that from a frequentist perspective, more specifically how confidence intervals should be interpreted, this rule of thumb is not without problems. Why this is the case would lead us too far, but there's a Bayesian argument to be made that leads to the same rule of thumb.

In the blogpost I mentioned earlier John D. Cook gives a Bayesian derivation:

Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 – exp(-3) as N gets large, and 1 – exp(-3) $\approx$ 0.95.

This derivation really relies on your knowledge of how to analytically come to an expression for the posterior distribution in this particular case. If you don't have that knowledge I present a more intuitive illustration here.

We start again from a uniform prior distribution. The likelihood is simply the binomial. We don't know which $p$ to use so we use grid-approximation and use 10,000 values of p evenly distributed between 0 and 1. We multiply prior and likelihood to have an unstandardized posterior. Then we standardize the posterior. Next we sample from the posterior and we summarize by picking up the 95th percentile.

Below you can find example code that illustrates how simple this idea can be implemented in R:

n<-100

p_grid<-seq(from=0, to=1, length.out=10000)

prior<-rep(1,10000)

likelihood<-dbinom(0, size=n, prob=p_grid)

unstd.posterior<-likelihood*prior

posterior<-unstd.posterior/sum(unstd.posterior)

samples<-sample(p_grid, prob=posterior, size=1e5, replace=TRUE)

q95<-as.numeric(quantile(samples, 0.95))

print(q95)

The result is close to Fisher's rule of thumb $p=\frac{3}{100}=0.03$.

Next to the Bayesian approach with grid-approximation (abbreviated as Bayes), and Fisher's rule of three (abbreviated as Fisher), we also work out the binomial case (abbreviated as Binomial).

To do this we go back a few equations a go , and work it out further:

$$\log(1-p)= \frac{-2.995732}{n},$$

and exponentiate both sides:

$$1-p= \exp\left(\frac{-2.995732}{n}\right),$$

which finally leads to:

$$p= 1- \exp\left(\frac{-2.995732}{n}\right).$$

I did this for $n$ going from 1 up to 50 and summarized it in the chart below.

First notice that the blue line with the results of the Bayesian analysis is more wobbly than the others. That's because we're sampling and we're using grid-approximation in that approach. But for all practical purposes we see that all three approaches are pretty much equal to each other as soon as $n>25$. Finally, the attentive reader will have observed that something odd happens when $n=1$. This will be discussed in follow-up blogpost!

Search This Blog

All Things Data Science

A note on observing zero successes

Comments

Post a Comment

Popular posts from this blog

(small) samples versus alternative (big) data sources

Market Research and Big Data: A difficult relationship

Why The Nielsen Company is an #mrx Big Data company avant la lettre