This blog posts gives an overview of several probability distributions. In particular it will cover the discrete uniform, the bernoulli and the binomial distributions. For each distributions the probability mass function (PMF), the expected value and the variance will be given.

But first let's dive into what these distributions and their properties represent.

# Distributions

Often when performing a test we are interested in the result based on an input parameter. For example what is the chance that something will happen (the result) when trying it N times (the input parameter)? Such a question resembles a mathematical function with input and output. Indeed knowing such a function is knowing about the distribution. One function alone is not enough though. A distribution is described by several functions (properties of the distribution) each using particular input parameters.

Once there are a bunch of properties describing the distribution there are more interesting things that can be done. One thing is the realization that a lot of problems can be generalized and can be solved in standardized ways. It may not come as a surprise that rolling a four sided die and a six sided die have similar ways of finding the distribution and properties such as the expected value.

All the distributions that we will discuss are discrete probability distributions in which the distribution tells something about the chance a particular outcome will happen.

Let's first go over the three properties of distributions, before talking about the distributions themselves.

# Probability Mass Function

The PMF describes the probability that the outcome is a certain value. "Mass" here refers to the probability being calculated for a single outcome (the mass) as a discrete value, rather than describe the probability over a section as a continuous value.

# Expected value

This is simply the average value that we expect after performing the experiment many times. Also take a look at this blog post which goes deeper into the expected value.

# Variance & friends

Absolute deviation, variance and standard deviation all tell us something about how the outcomes are spread out from the average (expected) value. There is always the question why do you use one or the other. I invite you to google “absolute deviation vs standard deviation” and find out for yourself. However in lectures about probability the Variance is covered most. The variance is a squared number, while the two deviations are closer to the original numbers.

Just looking at a single measure of spread of a distribution it's hard to tell what it represents. Looking at the plot of a PMF works much better. However the numbers can be very useful when comparing distributions. An excellent writing about the important of variance can be found here. The three measures of spread are defined as follows (mathematical and python3):

Absolute deviation: \(\frac{1}{n}\sum_{i=1}^n |x_i-m(X)|\)

Variance: \(\operatorname{Var}(X) = \operatorname{E}\left[(X - \mu)^2 \right] = \operatorname{E}\left[X^2 \right] - \operatorname{E}[X]^2\)

Standard deviation: \(\sqrt{\operatorname{Var}(X)}\)

```
import random
from math import sqrt
outcomes = random.sample(range(1, 101), 10) # making a list of 10 numbers between 1 and 100
mu = sum(outcomes) / len(outcomes) # mu is the average
absolute_deviation = sum([abs(x - mu) for x in outcomes]) / len(outcomes)
variance = sum([(x - mu)**2 for x in outcomes]) / len(outcomes)
standard_deviation = sqrt(variance)
```

The python code is written in a way that the probability for each outcome is the same. That's why we can divide by the total amount of samples. Now remember that the expected value is defined as the sum of the values times their probability.

\(\operatorname{E}[X] =\sum_{i=1}^k x_i,p_i=x_1p_1 + x_2p_2 + \cdots + x_kp_k\)

With the variance being defined as \(\operatorname{Var}(X) = \operatorname{E}\left[(X - \mu)^2 \right]\), \(X\) takes the value of \((X - \mu)^2\), there for little \(x\) takes the value of \((x - \mu)^2\). And the formula for variance becomes

\(\operatorname{Var}(X) =\sum_{i=1}^k (x-\mu)^2_i,p_i=(x-\mu)^2_1p_1 + (x-\mu)^2_2p_2 + \cdots + (x-\mu)^2_kp_k\)

Now what follows is a python3 example where each sample has a different probability.

```
import random
outcomes = random.sample(range(1, 101), 10)
mu = sum(outcomes) / len(outcomes) # mu is the average
# "ps" here stands for "probabilities"
ps_numerators = random.sample(range(1, 101), 10)
ps_denominator = sum(ps_numerators)
ps = [x / ps_denominator for x in ps_numerators]
# ps is now a list of 10 numbers of (1 / x) where x is between 1 and 100
# together these 10 numbers sum to 1
variance = sum([(x - mu)**2 * p for (p, x) in zip(ps, outcomes)])
# which is the same as
variance_2 = sum([(x - mu)**2 * p for (p, x) in zip(ps_numerators, outcomes)]) / ps_denominator
```

It really pays off to understand both the equations as well as the python code. Going forward in statistics the variance is so important that you should understand this concept really well.

# The discrete uniform distribution

The discrete uniform distribution is one of the simplest distributions. So simple in fact that you wonder why people even bothered to name it. Never the less this distribution describes an experiment in which each outcome is equally likely. Such as rolling a (perfect) six sided die. The chance of rolling each number is the same for all numbers (\(\frac{1}{6}\)). For an n-sided die it follows that:

\(PMF = \frac{1}{n}\)

\(E[X] = \frac{1 + 2 + .. + n}{n} = \frac{1 + n}{2}\)

More generally, in case the numbers don't start at 1:

\(E[X] = \frac{a + b}{2}\)

I won't go into the variance, as it's more complicated and this distribution is not as important as the ones which are still to follow.

# Bernoulli distribution

The Bernoulli distribution is the discrete uniform distribution with only two choices. Instead of a 6-sided die you can imagine a coin flip. If the probability is not a perfect 50/50 but maybe 70/30 it speaks for itself that when one side has 70% chance the other must have 100% - 70% = 30% chance. In terms of \(p\), the chance of success is \(p\) and for failure it's \(1 - p\). It's very useful to take 1 for success and 0 for failure as it simplifies the math (for example the formula for variance).

\(PMF = p, p - 1\) (for the two points)

\(E[X] = p\) (this is trivial to proof yourself)

\(Var(X) = (0-p)^2 * (1 - p) + (1-p)^2 * p = p-p^2 = p(1 - p)\) proof

# Binomial distribution & geometric distribution

When a bernoulli trial (the coin flip) is performed more than once there are a whole lot of questions that we could ask, such as:

- What is the chance that i get 6 successes when i try 10 times?
- How many times do i need to try when i want to have a chance of 90% to get 80 successes?
- On average how many times to i need to try to get the first success?
- What is the chance that i get between 2 or 4 successes when i try 8 times?
- When i try 100 times, the chance to get at least X successes is 72%, how big is X?

Each of these questions asks a new (derived) probability. Which is of course not the same as the original probability \(p\) that we get a success on that single coin flip.

## Binomial distribution

For some of these questions it's important to have a finite number of trials. After all when flipping a coin a infinite amount of times the amount if heads is ... also infinitive? Not really a useful question for the more practical uses of statistics. Having a finite amount of trials is one of the conditions that must be met to have a binomial distribution.

To understand why the binomial distribution is called the way it is, it's more useful to first start with the question we want to have an answer to. That question is “what is the probability of having \(k\) successes when doing \(n\) bernoulli trails”. Let's answer some other questions and gradually work our way towards the main question. The first question being: what is the chance that the first trial was a success followed by nine failures?

\(p^1 * (1 - p)^9\)

For our original question we don't care about the position of the success trial, just that we want to calculate the chance for one success. Since there are 10 positions where the success can occur, we get:

\(10 * p^1 * (1 - p)^9\)

It turns out that generalizing to \(k\) successes instead of one we find the formula for the PMF, which is:

\({n\choose k} * p^k * (1-p)^{n-k}\)

Where \({n\choose k}\) is the binomial coefficient. If you are interested why it's called the “the binomial coefficient”, it's because it's the coeffient used (\(A\) in the formula below) when expanding a binomial (a polynomial of two terms).

\((x + y)^n = Ax^{n-0} y^0 + Ax^{n-1} y^1 + \dots + Ax^{n-n} y^n = \binom n0 x^{n-0} y^0 + \binom n1 x^{n-1} y^1 + \dots + \binom nn x^{n-n} y^n\)

So you need this distrubution when you want to know “k out of n bernoulli trials”-distribution, but since the biominal coefficient was in there they named it after that.

\(E[X] = np\)

Proofing this by using the formula of the expected value and the PMF is a bit wieldy (the part where it says “and the binomial theorem”). However the second part of the section on wikipedia gives a more intuitive answer. It's similar to playing a lottery game twice with the same chance. If each game you expect to win 600, how much do you expect to win when playing twice? 1200 you say, and you would be right. This linearity is a basic property of the expected value. In the same way you can say if you expect to win \(1*p + 0 * (1 - p) = p\) per “lottery” (one bernoulli trial). For \(n\) trials it will be \(p + p + \cdots + p\) (n times) \(= n * p\).

The variance follows the same linearity as the expected value. The variance of one bernoulli trial was \(p(1 - p)\) therefor the variance of the binomial distribution is \(np(1 - p)\)

## Geometric distribution

When flipping a coin until the first success, the amount of flips will be random (affected by a certain probability). When the probability of the first success is \(p\), the probability for a second success is: a failure \(1 - p\) on the first trial followed again by \(p\). Every time the success is postponed another failure occured first. We find the \(PMF = (1 - p)^{k-1}p\) where each failure adds a factor (ratio) of \(1 - p\). Euclid wrote about ratio's in terms of geometric shapes, that's where this distribution got it's name from.

\(E[X] = \frac{1}{p}\) read here how this formula was found and it's explanation.

\(Var(X) = \frac{1 - p}{p^2}\) the proof for this is not trivial.

When writing this blog post i tried to explain the variance formula intuitively, but even that got rather complicated. I will just remark that the standard deviation \((\sqrt{\frac{1 - p}{p^2}}\) approaches the expected value \(\frac{1}{p}\) for small values of \(p\).

# Conclusion

We've seen some frequently used discrete probability distributions and their properties. Hopefully soon another blog post will use this knowledge to solve some interesting examples.