It would be several years into his new life in England that a middle-aged De Moivre would take a real abiding interest in Jacob Bernoulli’s work on the Law of Large Numbers. To see what his interest led to, let’s visit Bernoulli’s theorem and the thought experiment that lead Bernoulli to its discovery.

In *Ars Conjectandi*, Bernoulli had imagined a large urn containing r black tickets and s white tickets. Both r and s are unknown to you and so is the true fraction p = r/(r+s) of black tickets in the urn. Now suppose you draw n tickets from the urn randomly with replacement and your random sample contains **X**_bar_n black tickets. Here, **X**_bar_n is the sum of n i.i.d. random variables. Thus, **X**_bar_n/n is the ratio of black tickets that you observe. In essence**, X**_bar_n/n is your estimate of the true value of p.

The number of black tickets **X**_bar_n found in a random sample of black and white tickets has the familiar binomial distribution. That is:

**X**_bar_n ~ Binomial(n, p)

Where n is the sample size, and p=r/(r+s) is the true probability of a single ticket being a black ticket. Of course, p is unknown to you since in Bernoulli’s experiment, the number of black tickets (r) and white tickets (s) are unknown to you.

Since **X**_bar_n is binomially distributed, its expected value E(**X**_bar_n) = np and its Var(**X**_bar_n) = np(1 — p). Again, since p is unknown, both the mean and variance of **X**_bar_n are also unknown.

Also unknown to you is the absolute difference between your estimate of p and the true value of p. This estimate is the error |**X**_bar_n/n — p|.

Bernoulli’s great discovery was to show that as the sample size n becomes very large, the odds of the error |**X**_bar_n/n — p| being smaller than any arbitrarily small positive number ϵ of your choosing become incredibly large. As an equation, his discovery can be stated as follows:

The above equation is the **Weak Law of Large Numbers**. In the above equation:

P(|**X_**bar_n/n — p| <= ϵ) is the probability of the estimation error being at most ϵ.

P(|**X**_bar_n/n — p| > ϵ) is the probability of the estimation error being greater than ϵ.

‘c’ is a very large positive number.

The WLLN can be stated in three other forms highlighted in the blue boxes below. These alternate forms result from doing some simple algebraic gymnastics as follows:

Now notice the probability in the third blue colored box:

P(μ — δ ≤ **X**_bar_n ≤ μ + δ) = (1 — α)

Or plugging back μ =np:

P(np — δ ≤ **X**_bar_n ≤ np + δ) = (1 — α)

Since **X**_bar_n ~ Binomial(n,p), it is straightforward to express this probability as a difference of two binomial probabilities as follows:

But it is at this point that things stop being straightforward. For large n, the factorials inside the two summations become enormous and near about impossible to calculate. Imagine having to calculate 20!, leave alone 100! or 1000!. What is needed is a good approximation technique for factorial(n). In *Ars Conjectandi,* Jacob Bernoulli made a few weak attempts at approximating these probabilities, but the quality of his approximations left a lot to be desired.

## Abraham De Moivre’s big idea

In the early 1700s, when De Moivre first began looking at Bernoulli’s work, he immediately sensed the need for a fast, high quality approximation technique for the factorial terms in the two summations. Without an approximation technique, Bernoulli’s great accomplishment was like a big, beautiful kite without a string. A law of great beauty but of little practical use.

De Moivre recast the problem as an approximation for the sum of successive terms in the expansion of (a + b) raised to the nth power. This expansion, known as the binomial formula, goes as follows:

De Moivre’s reasons for recasting the probabilities in the WLLN in terms of the binomial formula were arrestingly simple. It was known that if the sample sum **X**_bar_n has a binomial distribution, the probability of X_bar_n being less than or equal to some value n can be expressed as a sum of (n+1) probabilities as follows:

If you compare the coefficients of the terms on the R.H.S. of the above equation with the coefficients in the terms in the expansion of (a+b) raised to n, you’ll find them to be remarkably similar. And so De Moivre theorized, if you find a way to appropriate the factorial terms in the R.H.S. of (a+b) raised to n, you have paved the way for approximating P(**X**_bar_n ≤ n), and thus also the probability lying at the heart of the Weak Law of Large Numbers, namely:

P(np — δ ≤ **X**_bar_n ≤ np + δ) = (1 — α)

For over 10 years, De Moivre toiled on the approximation problem creating increasingly accurate approximations of the factorial terms. By 1733, he had largely concluded his work when he published what came to be called **De Moivre’s theorem** (or less accurately, the De Moivre-Laplace theorem).

At this point, I could just state De Moivre’s theorem but that will spoil half the fun. Instead, let’s follow along De Moivre’s train of thought. We’ll work through the calculations leading up to the formulation of his great theorem.

Our requirement is for a fast, high accuracy approximation technique for the probability that lies at the heart of Bernoulli’s theorem, namely:

P(|**X**_bar_n/n — p| ≤ ϵ)

Or equivalently its transformed version:

P(np — δ ≤ **X**_bar_n ≤ np + δ)

Or in the most general form, the following probability:

P(x_1 ≤ **X** ≤ x_2)

In this final form, we have assumed that **X** is a discrete random variable that has a binomial distribution. Specifically, **X** ~ Binomial(n,p).

The probability P(x_1 ≤ **X** ≤ x_2) can be expressed as follows:

Now let p, q be two real numbers such that:

0 ≤ p ≤ 1, and 0 ≤ q ≤ 1, and q = (1 — p).

Since **X** ~ Binomial(n,p), E(**X**) = μ = np, and Var(**X**) = σ² = npq.

Let’s create a new random variable **Z** as follows:

**Z** is clearly the standardized version of **X**. Specifically, **Z** is a **standard normal random variable**. Thus,

If **X** ~ Binomial(n,p), then **Z** ~ N(0, 1)

Keep this in mind for we’ll visit this fact in just a minute.

With the above framework in place, De Moivre showed that *for very large values of n*, the probability:

P(x1 ≤ **X** ≤ x2)

can be approximated by evaluating the following specific type of integral:

The ≃ sign means the L.H.S. asymptotically equals the R.H.S. In other words, as the sample size grows to ∞, L.H.S. = R.H.S.

Did you notice something familiar about the integral on R.H.S? It’s **the formula for the area under a standard normal variable’s probability density curve** from z_1 to z_2.

And the formula inside the integral is the **Probability Density Function** of the **standard normal random Z**:

Let’s split apart the integral on the R.H.S. as a difference of two integrals as follows:

The two new integrals on the R.H.S. are the respectively the cumulative densities P(**Z** ≤ z_2), and P(**Z** ≤ z_1).

The **Cumulative Density Function **P(**Z** ≤ z) of a standard normal random variable is represented using the standard notation:

𝛟(z)

Therefore, the integral on the L.H.S. of the above equation is equal to:

𝛟(z_2) — 𝛟(z_1).

Bringing it all together, we can see that the probability:

P(x1 ≤ **X** ≤ x2)

asymptotically converges to 𝛟(z_2) — 𝛟(z_1):

Now recall how we defined **Z** as a standardized **X **:

And thus we also have the following:

While formulating his theorem, De Moivre defined the limits x_1 and x_2 as follows:

Substituting these values of x_1 and x_2 in the previous set of equations, we get:

And therefore, De Moivre showed that for very large *n*:

Remember, what De Moivre really wanted was to approximate the probability on the L.H.S. of Bernoulli’s theorem:

Which he succeeded in doing by making the following simple substitutions:

Which produces the following asymptotic equality:

In a single elegant stroke, De Moivre showed how to approximate the probability in Bernoulli’s theorem for large sample sizes. And large sample sizes is what Bernoulli’s theorem is all about. There is however some subtext to De Moivre’s achievement. The integral on the R.H.S. does not have a closed form and De Moivre approximated it using an infinite series.

## An illustration of De Moivre’s Theorem

Suppose there are exactly three times as many black tickets as white tickets in the urn. So the true fraction of black tickets, p, is 3/4. Suppose also that you draw a random sample with replacement of 1000 tickets. Since p=0.75, the expected value of black tickets is np = 750. Suppose the number of black tickets you observe in the sample is 789. What is the probability of drawing such a random sample?

Let’s set out the facts:

We wish to find out:

P(750 — 39 ≤ **X_**bar_n <= 750 + 39)

We’ll use De Moivre’s Theorem to find this probability. As we know, the theorem can be stated as follows:

We know that n=1000, p=0.75, **X**_bar_n=789, and δ=39. We can find k as follows:

Plugging in all the values:

In approximately 99.56% of random samples of size 1000 tickets each, the number of black tickets will lie between 711 and 789.