Introduction to Statistics: Probability

Author: Sarah Mancina

In order to cover statistics, we first have to cover probability. Statistics helps us map from what we observe in our detector to the physical truth, probability helps us map from the physical truth to what we could observe in our detector.

Data

Types of Data

To start with, we'll talk about the different types of data. This is pretty simple, but it's helpful to think about the basics of data to understand probability and statitics.

First we can break it down into quantatative and qualitative.

  • Quantatative: Numerical data. E.g. How tall are you? How many pairs of socks do you own?
  • Qualitative: Categorical data. E.g. What is your favorite color? Do you like pizza?

Qualitative data can be broken down into a couple of sub-categories:

  • Nominal: The data are linked to discrete labels which cannot be ordered. E.g. What is your favorite color? [Red, orange, yellow, green, blue, purple]
  • Binary: There are only two answers to the question. E.g. Do you like pizza? [no (0), yes(1)]
  • Oridnal: The labels can be ordered, but the difference in values is not meaningful beyond order. E.g. How much do you like pizza? [I hate it (0), I dislike it (1), I'm neutral (2), I like it (3), I love it (4)]

Quantatitve data can be broken down into two sub-categories:

  • Discrete: Data that has distinct values and can be counted. E.g. How many pairs of socks do you own? [1, 2, 3, 4, ...]
  • Continuous: Data that is measured on a continuum. E.g. how tall are you? [5.10 ft, 5.15 ft, 5.20 ft, ...]

Displaying Data

How we display our data is a very important part in understanding and communicating our results. There are several ways to display data. Some tools are better suited for qualitative or quantatative data. Common ways to display qualitative data are tables, bar charts, pie charts, and more. We will focus more on quntatative data for the rest of this exercise.

Histograms

In physics we commonly use histograms to display our data. A histogram is a type of plot where the data is separated into discrete bins and for each bin a value proportional to the number of counts in the bin is assigned. The value assigned to the bin is usually one of the three following values:

  • Counts per Bin: The total number of counts in the bin.
  • Relative Frequency: The normalized number of counts in each bin calculated through the following formula: $$f_i = \frac{n_i}{\sum^{bins}_{i}n_i}$$ where $f_i$ is the frequency for bin $i$ and $n_i$ is the number of counts in bin $i$.
  • Density: The frequency divided by the width of the bin as shown in the following formula: $$d_i = \frac{n_i}{w_i \sum^{bins}_{i}n_i}$$ where $d_i$ is the density for bin $i$, $n_i$ is the number of counts in bin $i$, and $w_i$ is the width of bin $i$. The density can be quiet useful especially when using uneven histogram bins. When using the desnsity the area of the bin, instead of the height, is proportional to the number of counts.

To create and plot histograms in python we can use a couple of different tools from numpy and matplotlib. To read more about these objects you can go to their documentation sites here: numpy.histogram, numpy.histogram2d, matplotlib.pyplot.hist, and matplotlib.pyplot.hist2d.

In [1]:
#import relavent libraries
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

#change pyplot defaults to make plots easier to read
plt.rc('font', family = 'sans-serif', size = 16)
plt.rc('legend', fontsize = 16)
plt.rc('axes', labelsize = 24)
plt.rc('axes', titlesize = 24)
plt.rc('axes', linewidth = 1.5)
plt.rc('xtick', direction = 'in', labelsize = 16)
plt.rc('ytick', direction = 'in', labelsize = 16)

#jupyter notebook magic
%matplotlib inline

Let's generate some random data to histogram using numpy.random.uniform. This will draw random data points from a uniform distribution (a distribution where all points from the low value to the high value have the same probability) which we will discuss later.

In [2]:
#get 1000 random data points from a uniform (flat) distribution from 0 to 1
x = np.random.uniform(low = 0.0, high = 1.0, size = 1000)

#make a plot of these random variables
fig = plt.gcf() #grab current figure
fig.set_size_inches(8,8) #change size of figure to make larger

#plot a histogram from the data in the numpy array x
plt.hist(x)

#add labels to figure
plt.xlabel('x')
plt.ylabel('Counts per Bin')
plt.show()

This is the basic histogram plotted by matplotlib.pyplot.hist.

An important parameter in histogramming is binning. If the bins are too thin/too numerous, then the bins will be subject to statistical fluctuations of the small number of events in each bin. If the bins are too wide/too few, then then bins might lose some of the detail that the creator might want to show.

Most python histogramming tools can take a number or an array of bin edges for the bins argument. If given a number, the histogram will bin the data points into that number of bins or even width from the minimum value in the array to the maximum value in the array. If given an array of bin edges, the histogram object will bin the data points into the bins defined by the edges.

In [3]:
#make a plot of these random variables
fig, axes = plt.subplots(nrows = 2 , ncols = 2) #grab figure and axes to make a subplot
fig.set_size_inches(16,16) #change size of figure to make larger


#plot a histogram from the data in the numpy array x and use 10 bins
axes[0][0].hist(x, bins= 10)

#add labels to figure
axes[0][0].set_xlabel('x')
axes[0][0].set_ylabel('Counts per Bin')
axes[0][0].set_title('10 Bins')

#plot a histogram from the data in the numpy array x and use 20 bins
axes[0][1].hist(x, bins= 50)

#add labels to figure
axes[0][1].set_xlabel('x')
axes[0][1].set_ylabel('Counts per Bin')
axes[0][1].set_title('50 Bins')

#plot a histogram from the data in the numpy array x and use 5 bins
axes[1][0].hist(x, bins= 5)

#add labels to figure
axes[1][0].set_xlabel('x')
axes[1][0].set_ylabel('Counts per Bin')
axes[1][0].set_title('5 Bins')

#plot a histogram from the data in the numpy array x and using set bin edges
bin_edges = np.linspace(0, 1., 10 + 1) #make array of 11 bin edges from 0 to 1 
axes[1][1].hist(x, bins= bin_edges)

#add labels to figure
axes[1][1].set_xlabel('x')
axes[1][1].set_ylabel('Counts per Bin')
axes[1][1].set_title('Array of Bin Edges')

plt.show()

The pyplot.hist is good for plotting the histogram; however, if you want to manipulate the output of a histogram before you plot it, you can use the returned values from numpy.histogram. The objects returned are an array of the number of counts in each bin and the bin edges. Let's take the plots above and plots the relative frequency instead of counts instead.

In [4]:
#make a plot of these random variables
fig, axes = plt.subplots(nrows = 2 , ncols = 2) #grab figure and axes to make a subplot
fig.set_size_inches(16,16) #change size of figure to make larger


#plot a histogram from the data in the numpy array x and use 10 bins
n, edges = np.histogram(x, bins = 10)
freq = n / float(len(x)) #divide n by total number of data events
x_mids = (edges[:-1] + edges[1:])/2. #calculate the middles values of each bin to give to the plot

#to plot the frequency we give the frequency of each bin as a weight for the mids
axes[0][0].hist(x_mids, weights = freq, bins = edges)

#add labels to figure
axes[0][0].set_xlabel('x')
axes[0][0].set_ylabel('Relative Frequency')
axes[0][0].set_title('10 Bins')

#plot a histogram from the data in the numpy array x and use 20 bins
n, edges = np.histogram(x, bins = 20)
freq = n / float(len(x)) #divide n by total number of data events
x_mids = (edges[:-1] + edges[1:])/2. #calculate the middles values of each bin to give to the plot

#to plot the frequency we give the frequency of each bin as a weight for the mids
axes[0][1].hist(x_mids, weights = freq, bins = edges)

#add labels to figure
axes[0][1].set_xlabel('x')
axes[0][1].set_ylabel('Relative Frequency')
axes[0][1].set_title('50 Bins')

#plot a histogram from the data in the numpy array x and use 5 bins
n, edges = np.histogram(x, bins = 5)
freq = n / float(len(x)) #divide n by total number of data events
x_mids = (edges[:-1] + edges[1:])/2. #calculate the middles values of each bin to give to the plot

#to plot the frequency we give the frequency of each bin as a weight for the mids
axes[1][0].hist(x_mids, weights = freq, bins = edges)


#add labels to figure
axes[1][0].set_xlabel('x')
axes[1][0].set_ylabel('Relative Frequency')
axes[1][0].set_title('5 Bins')

#plot a histogram from the data in the numpy array x and using set bin edges
bin_edges = np.linspace(0, 1., 10 + 1) #make array of 11 bin edges from 0 to 1 
n, edges = np.histogram(x, bins = bin_edges)
freq = n / float(len(x)) #divide n by total number of data events
x_mids = (edges[:-1] + edges[1:])/2. #calculate the middles values of each bin to give to the plot

#to plot the frequency we give the frequency of each bin as a weight for the mids
axes[1][1].hist(x_mids, weights = freq, bins = edges)


#add labels to figure
axes[1][1].set_xlabel('x')
axes[1][1].set_ylabel('Relative Frequency')
axes[1][1].set_title('Array of Bin Edges')

plt.show()

The scale of the relative frequency is dependenent on the size of the bins. This is not the case for the density. Density can be a useful tool since integrals of the histograms can be related to the probability, we'll touch on that later.

Excersice: plot the density of the data array above using 10 bins from 0 to 1. There are two simple ways to do this, one using numpy.histogram and one using matplotlib.pyplot.hist. If you are stuck check the documentation for each of the histogram tools.

In [5]:
###SOLUTION

#----SOLUTION 1----
fig = plt.gcf() #grab current figure
fig.set_size_inches(8,8) #change size of figure to make larger

#plot a histogram from the data in the numpy array x and using set bin edges
bin_width = 0.1
np.arange(0, 1.01, bin_width) #make array of 11 bin edges from 0 to 1 
n, edges = np.histogram(x, bins = bin_edges)
density = (n / float(len(x))) / bin_width #divide n by total number of data events and the bin width
x_mids = (edges[:-1] + edges[1:])/2. #calculate the middles values of each bin to give to the plot

#to plot the frequency we give the frequency of each bin as a weight for the mids
plt.hist(x_mids, weights = density, bins = edges)


#add labels to figure
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Manipulate Numpy Histogram')

plt.show()

#----SOLUTION 2----
fig = plt.gcf() #grab current figure
fig.set_size_inches(8,8) #change size of figure to make larger

#plot a histogram from the data in the numpy array x and using set bin edges
bin_edges = np.linspace(0, 1., 10 + 1) #make array of 11 bin edges from 0 to 1 
plt.hist(x, bins = bin_edges, density = True)

#add labels to figure
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Using Pyplot.Hist Parameters')

plt.show()

###END OF SOLUTION

Descriptive Statistics

There are a couple of different statistics you can draw from a data to help describe the set. The most common are below:

  • Aritmetic Mean: $\bar{x} = \sum_{i = 0}^N \frac{1}{n}x_i$
  • Median: The value lying at the midpoint of a distribution of observed values where there is equal probability of lying above or below.
  • Standard Deviation: Measure of the spread $\sigma = \sqrt{\sum_{i=0}^N \frac{1}{N}(x_i - \bar{x})^2}$
  • Correlation Coefficient: Measure the linear relationship between to variables $\rho = \frac{\bar{xy} - \bar{x}\bar{y}}{\sigma_x\sigma_y}$. $\rho = 0$ means there is no linear correlation, $\rho = 1$ means a positive correlation, and $\rho = -1$ means their is a negative correlation.

Excersice: Calculate the probability of x, the uniformly distributed sample above. Use numpy methods to help simplify your code.

In [6]:
##SOLUTION

print(np.sqrt(np.sum((x - np.mean(x))**2)/len(x)))

##END OF SOLUTION
0.2829175581034826

Probability

Mathematical Probability

An experiment is any activity or process where the outcome is subject to uncertainty. The sample space is the set of all possible outcomes of an experiment. An event is a set of outcomes in the sample space.

Some notation for events:

  • The complement of event $A$, $\bar{A}$, is the set of all $a$ in the sample set where $a$ is not in $A$.
  • The union of events $A$ and $B$, $A \cup B$, is all $a$ in the sample set, $S$, if $a$ is in $A$ or $B$
  • The intersection of events $A$ and $B$, $A \cap B$, is all $a$ in the sample set, $S$, if $a$ is in $A$ and $B$

The foundation of probability are the Kolmogorov Axioms, which are as follows.

  1. For any event $A$ in the sample space $S$, $$P(A)\geq0$$
  2. $P(A) + P(\bar{A}) = 1$ where $\bar{A}$ is the complement of $A$ in $S$
  3. For mutually exclusive events, $A_1$, $A_2$, ... in $S$, $$P(A_1 \cup A_2 \cup ...) = \sum_{i=1}^{\infty}P(A_i)$$

Frequentist Probability

For this lecture, we will only cover the Frequentist definitions of probability and statistics. However, there is also Bayesian statistics which is a useful tool.

The frequentist definition of probaility is: $$P(X) = \lim_{N\to\infty} \frac{n}{\mathrm{N}}.$$ Where $n$ is the number of times outcome $X$ occured over $N$ trials. The law of large numbers says that as N approaches infinity in an experiement, the measurement of $P(X)$ approaches it's true value.

Let's test this by simulating a simple dice roll.

In [7]:
print('True probability of rolling 1 for a fair dice:',1/6)

sizes = [10, 100, 1000, 10000, 100000, 1000000]
for size in sizes:
    x = np.random.randint(1,7,size) #Randomly draw values of 1 through 6
    p_1 = len(np.where(x == 1)[0])/float(size) #Calculated the frequentist probability estimate
    print('Calculated probability of rolling 1 for',size,'trials:',p_1)
True probability of rolling 1 for a fair dice: 0.16666666666666666
Calculated probability of rolling 1 for 10 trials: 0.0
Calculated probability of rolling 1 for 100 trials: 0.2
Calculated probability of rolling 1 for 1000 trials: 0.16
Calculated probability of rolling 1 for 10000 trials: 0.1588
Calculated probability of rolling 1 for 100000 trials: 0.16801
Calculated probability of rolling 1 for 1000000 trials: 0.166631

Probability Distributions

Sometimes variables have a well defined probability distribution. If the variable is descrete it's distribution is called a Probability Mass Functions (PMF), otherwise if the variable is continuous it's distribution is called a Probability Density Function (PDF). For a PMF, $$\sum_{i=0}^{max} PMF(x_i) = 1$$ and for a PDF, $$\int_{min}^{max} PDF(x)dx = 1.$$ We will go through some of the most common PMF's and PDF's.

We can also use probability distributions to find an expected value. For example if we wanted to find the mean, the expected value of $x$, we would like to find the expected variable value using the following functions $$\bar{x} = \sum_{i=min}^{max} x_i PMF(x_i)$$ and $$\bar{x} = \int_{min}^{max} xPDF(x)dx = 1.$$ If we wanted to find the expected value of any function of $x$, $g(x)$, we can use the same method: $$\bar{x} = \sum_{i=min}^{max} g(x_i) PMF(x_i)$$ and $$\bar{x} = \int_{min}^{max} g(x) PDF(x)dx = 1.$$

The cummulative distribution is for a given value, $r$, the probability of all the outcomes $\leq r$. $$CMF = \sum_{i=0}^{r} PMF(x_i)$$ or $$CDF = \int_{min}^{r} PDF(x)dx$$

Binomial Distribution

A binomial distribution is a PMF which describes the probability of the number of successes in a squence of independent experiments with a binary outcome (Bernoulli trial). An example would be rolling a dice and calling rolling 1 a success and rolling anything else a failure. Let's try to use this example to build the binomal distribution.

If you a roll a dice 5 times, what would be the probability that you would roll a 1 twice? One way to roll a 1 twice is to roll the 1 in the first two rolls: $$P(\text{1 & 1 & not 1 & not 1 & not 1}) = 1/6 \times 1/6 \times 5/6 \times 5/6 \times 5/6.$$ But this isn't the only way to roll 1 twice. Luckily we can see if no matter what order we roll 1 twice in five rolls, the probability for each combination is the same. We must include the combinatorics to calculate the total probability $$P(\text{Roll 1 twice}) = {5\choose 2}(1/6)^2 (5/6)^3$$ Where ${5\choose 2} = \frac{5!}{2!3!}$.

The general form of the Binomial distribution is as follows: $$P(k) = {n\choose k}p^k(1-p)^{n-k} \text{, where } {n\choose k} = \frac{n!}{k!(n-k)!}.$$ Where $k$ is the number of successes, $n$ is the number of trials, and $p$ is the probability of a single success.

We can find the moments of a distribution by solving for the expectation values. For the binomial distribution, the mean is $np$ and the standard deviation is $\sqrt{np(1-p)}$.

Excercise: You are doing a magic trick with a normal deck of cards with no jokers. What is the probability after repeating the trick 5 times that at least one ace was drawn by the participant?

Poisson Distribution

The poisson distribution is a PMF which gives the probability of a number of events occuring. The form is as follows: $$P(k) = \frac{e^{-\lambda}\lambda^k}{k!}$$ Where $\lambda$ is the average number of events that occur. The mean of the distribution is therefore $\lambda$ and the standard deviation is $\sqrt{\lambda}$.

Excercise: At a bus stop, the average rate of buses is one bus per 20 minutes. What is the probability that 5 buses come within an hour??

Normal Distribution

The normal distribution is one of the most common PDFs (partially because of the Central Limit Theorem). It is often a good assumption to assume measurements are normally distributed. The normal distribution is also known as the Gaussian distribution and has the following form: $$P(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x - \mu)^2/(2\sigma^2)}$$ where $\mu$ is the mean of the distribution and $\sigma$ is the standard deviation. The normal distribution has a characteristic bell curve shape which is peaked at $\mu$ and it's width is affected by $\sigma$.

Excersise: Show that for large $n$ or large $\lambda$ the binomial or Poisson distribution can be approximated by a normal distribution. Hint: use np.random.normal, np.random.poisson, and np.random.binomial. Hint 2: What should you use for the standard deviation of the normal distribution?

In [20]:
def plot_distribution(dist_1, dist_2, bins = 20, labels = ['Binomial', 'Normal']):
    #make a plot comparing two distributions
    fig = plt.gcf() #grab current figure
    fig.set_size_inches(8,8) #change size of figure to make larger
    
    plt.hist(dist_1, bins = bins, 
             linewidth = 2, histtype = 'step', color = 'dodgerblue',
             density = True, label = labels[0])
    plt.hist(dist_2, bins = bins, 
             linewidth = 2, histtype = 'step', color = 'darkorange',
             density = True, label = labels[1])
    
    
### SOLUTION 1 ###
ntrials = np.linspace(1, 51, 5 + 1)
p = 0.25
for n in ntrials:
    binom = np.random.binomial(n, p, size = 1000000)
    norm  = np.random.normal(loc = n*p, scale = np.sqrt(n*p*(1-p)), size= 1000000)
    
    minbin = n*p - 5*np.sqrt(n*p*(1-p))
    maxbin = n*p + 5*np.sqrt(n*p*(1-p))
    bins = np.linspace(minbin, maxbin, (maxbin-minbin)+1)
    
    plot_distribution(binom, norm, bins = bins, labels = ['Binomial', 'Normal'])
    
    plt.legend()
    plt.title(r'n = {:.2f}'.format(n))
    plt.xlabel('Variable')
    plt.ylabel('Density')
    plt.show()
### END OF SOLUTION ###
    
    
### SOLUTION 2 ###
lambdas = np.linspace(1, 51, 5 + 1) 
for l in lambdas:
    pois = np.random.poisson(lam = l, size = 1000000)
    norm = np.random.normal(loc = l, scale = np.sqrt(l), size= 1000000)
    minbin = l - 5*np.sqrt(l)
    maxbin = l + 5*np.sqrt(l)
    bins = np.linspace(minbin, maxbin, np.ceil(10*np.sqrt(l))+1)
    
    plot_distribution(pois, norm, bins = bins, labels = ['Poisson', 'Normal'])
    
    plt.legend()
    plt.title(r'$\lambda$ = {:.1f}'.format(l))
    
    plt.xlabel('Variable')
    plt.ylabel('Density')
    plt.show()
### END OF SOLUTION ###

Excersice: Making Monte Carlo Simulation

A common tool to use in research is simulations of data which is often referred to as Monte Carlo (mc). MC relies on probability to create a fake sampling of data.

Let's create a sample of neutrinos. We'll assume that the expected number of neutrinos is 100, that the neutrinos are isotropically distributed in direction, and that the neutrinos have an average energy of 1TeV with a standard deviation of 2 in log10(Energy).

In [29]:
N = np.random.poisson(lam = 100, size = 1)[0]

zeniths  = np.random.uniform(0, np.pi, size = N)
azimuths = np.random.uniform(0, 2*np.pi, size = N)

logE = np.random.normal(loc = 1, scale = 2, size = N)
energies = np.power(10,logE)

print('Number of Events: ',N)

fig = plt.gcf() #grab current figure
fig.set_size_inches(8,8) #change size of figure to make larger

plt.hist(np.degrees(zeniths), bins = np.linspace(0, 180, 5 + 1),
         histtype = 'step')
plt.ylabel('Density')
plt.xlabel(r'Zeniths ($^\circ$)')

plt.show()


fig = plt.gcf() #grab current figure
fig.set_size_inches(8,8) #change size of figure to make larger

plt.hist(np.degrees(azimuths), bins = np.linspace(0, 360, 5 + 1),
         histtype = 'step')
plt.ylabel('Density')
plt.xlabel(r'Azimuths ($^\circ$)')

plt.show()


fig = plt.gcf() #grab current figure
fig.set_size_inches(8,8) #change size of figure to make larger

plt.hist(energies, bins = np.logspace(-5, 6, 5 + 1),
         histtype = 'step')
plt.ylabel('Density')
plt.xlabel('Energies')
plt.semilogx()

plt.show()
Number of Events:  83