$\pagebreak$

## Overview
* **Statistical Inference** = generating conclusions about a population from a noisy sample
* Goal = extend beyond data to population
* Statistical Inference = only formal system of inference we have
* many different modes, but **two** broad flavors of inference (inferential paradigms): ***Bayesian*** vs ***Frequencist***
    * **Frequencist** = uses long run proportion of times an event occurs independent identically distributed repetitions
		* frequentist is what this class is focused on
		* believes if an experiment is repeated many many times, the resultant percentage of success/something happening defines that population parameter
	* **Bayesian** = probability estimate for a hypothesis is updated as additional evidence is acquired
* **statistic** = number computed from a sample of data
	* statistics are used to infer information about a population
* **random variable** = outcome from an experiment
	* deterministic processes (variance/means) produce additional random variables when applied to random variables, and they have their own distributions


## Probability
* **Probability** = the study of quantifying the likelihood of particular events occurring
    * given a random experiment, ***probability*** = population quantity that summarizes the randomness
	    * not in the data at hand, but a conceptual quantity that exist in the population that we want to estimate

### General Probability Rules
* discovered by Russian mathematician Kolmogorov, also known as "Probability Calculus"
* probability = function of any set of outcomes and assigns it a number between 0 and 1
    * $0 \le P(E) \le 1$, where $E$ = event
* probability that nothing occurs = 0 (impossible, have to roll dice to create outcome), that something occurs is 1 (certain)
* probability of outcome or event $E$, $P(E)$ = ratio of ways that $E$ could occur to number of all possible outcomes or events
* probability of something = 1 - probability of the opposite occurring
* probability of the **union** of any two sets of outcomes that have nothing in common (mutually exclusive) = sum of respective probabilities

```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'}
library(grid);library(png)
grid.raster(readPNG("figures/1.png"))
```

* if $A$ implies occurrence of $B$, then $P(A)$ occurring $< P(B)$ occurring

```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'}
grid.raster(readPNG("figures/2.png"))
```

* for any two events, probability of at least one occurs = the sum of their probabilities - their intersection (in other words, probabilities can not be added simply if they have non-trivial intersection)

```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'}
grid.raster(readPNG("figures/3.png"))
```

* for independent events $A$ and $B$, $P(A \:\cup\: B) = P(A) \times P(B)$
* for outcomes that can occur with different combination of events and these combinations are mutually exclusive, the $P(E_{total}) =  \sum P(E_{part})$


### Conditional Probability
* let B = an event so that $P(B) > 0$
* **conditional probability** of an event A, given B is defined as the probability that BOTH A and B occurring divided by the probability of B occurring $$P(A\:|\:B) = \frac{P(A \:\cap\: B)}{P(B)}$$
* if A and B are ***independent***, then $$P(A\:|\:B) = \frac{P(A)P(B)}{P(B)} = P(A)$$
* ***example***
	* for die roll, $A = \{1\}$, $B = \{1, 3, 5\}$, then $$P(1~|~Odd) = P(A\:|\:B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)}{P(B)} = \frac{1/6}{3/6} = \frac{1}{3}$$

### Baye's Rule
* definition $$P(B\:|\:A) = \frac{P(A\:|\:B)P(B)}{P(A\:|\:B)P(B)+P(A\:|\:B^c)P(B^c)}$$ where $B^c$ = corresponding probability of event $B$, $P(B^c) = 1 - P(B)$

$\pagebreak$

## Random Variables
* **random variable** = numeric outcome of experiment
* **discrete** (what you can count/categories) = assign probabilities to every number/value the variable can take
	* coin flip, rolling a die, web traffic in a day
* **continuous** (any number within a continuum) = assign probabilities to the range the variable can take
	* BMI index, intelligence quotients
	* ***Note**: limitations of precision in taking the measurements may imply that the values are discrete, but we in fact consider them continuous *
+ `rbinom()`, `rnorm()`, `rgamma()`, `rpois()`, `runif()` = functions to generate random variables from the binomial, normal, Gamma, Poisson, and uniform distributions
* density and mass functions (population quantities, not what occurs in data) for random variables = best starting point to model/think about probabilities for numeric outcome of experiments (variables)
	* use data to estimate properties of population $\rightarrow$ linking sample to population

### Probability Mass Function (PMF)
* evaluates the probability that the **discrete random variable** takes on a specific value
	- measures the chance of a particular outcome happening
	* always $\ge$ 0 for every possible outcome
	* $\sum$ possible values that the variable can take = 1
* ***Bernoulli distribution example***
	* X = 0 $\rightarrow$ tails, X = 1 $\rightarrow$ heads
        * X here represents potential outcome
	* $P(X = x) = (\frac{1}{2})^x(\frac{1}{2})^{1-x}$ for $X = 0, 1$
        * $x$ here represents a value we can plug into the PMF
		* general form $\rightarrow$ $p(x) = (\theta)^x(1-\theta)^{1-x}$
* `dbinom(k, n, p)` = return the probability of getting `k` successes out of `n` trials, given probability of success is `p`

### Probability Density Function (PDF)
* evaluates the probability that the **continuous random variable** takes on a specific value 
	* always $\ge$ 0 everywhere
	* total area under curve must = 1
* **areas under PDFs** correspond to the probabilities for that random variable taking on that range of values (PMF)

```{r echo = FALSE, fig.width = 2, fig.height = 2, fig.align = 'center'}
grid.raster(readPNG("figures/4-1.png"))
```

* but the probability of the variable taking a specific value = 0 (area of a line is 0)

```{r echo = FALSE, fig.width = 3, fig.height = 3, fig.align = 'center'}
grid.raster(readPNG("figures/4-2.png"))
```
* ***Note**: the above is true because it is modeling random variables as if they have infinite precision, when in reality they do not *

* `dnorm()`, `dgamma()`, `dpois()`, `dunif()` = return probability of a certain value from the normal, Gamma, Poisson, and uniform distributions

### Cumulative Distribution Function (CDF)
* CDF of a random variable $X$ = probability that the random variable is $\le$ value $x$
	* $F(x) = P(X \le x)$ = applies when $X$ is discrete/continuous
* PDF = derivative of CDF
	* integrate PDF $\rightarrow$ CDF
		* `integrate(function, lower=0, upper=1)` $\rightarrow$ can be used to evaluate integrals for a specified range
* `pbinom()`, `pnorm()`, `pgamma()`, `ppois()`, `punif()` = returns the cumulative probabilities from 0 up to a specified value from the binomial, normal, Gamma, Poisson, and uniform distributions


### Survival Function
* survival function of a random variable $X$ = probability the random variable $> x$, complement of CDF
    * $S(x) = P(X > x) = 1 - F(x)$, where $F(x) =$ CDF


### Quantile
* the $\alpha^{th}$ quantile of a distribution with distribution function F = point $x_{\alpha}$
	* $F(x_{\alpha}) = \alpha$
	* percentile = quantile with $\alpha$ expressed as a percent
	* median = 50^th^ percentile
	* $\alpha$% of the possible outcomes lie below it

```{r echo = FALSE, fig.width = 3, fig.height = 3, fig.align = 'center'}
grid.raster(readPNG("figures/5.png"))
```

* `qbeta(quantileInDecimals, 2, 1)` = returns quantiles for beta distribution
    * works for `qnorm()`, `qbinom()`, `qgamma()`, `qpois()`, etc.
* median estimated in this fashion = a population median
* probability model connects data to population using assumptions
	* population median = ***estimand***, sample median = ***estimator***

### Independence
* two events $A$ and $B$ are ***independent*** if the following is true
	* $P(A\:\cap\:B) = P(A)P(B)$
	* $P(A\:|\:B) = P(A)$
* two random variables $X$ and $Y$ are ***independent***, if for any two sets, **A** and **B**, the following is true
	* $P([X \in A]\cap[Y \in B]) = P(X \in A)P(Y \in B)$
* **independence** = statistically unrelated from one another
* if $A$ is ***independent*** of $B$, then the following are true
	* $A^c$ is independent of $B$
	* $A$ is independent of $B^c$
	* $A^c$ is independent of $B^c$


### IID Random Variables
* random variables are said to be **IID** if they are ***independent and identically distributed***
	* **independent** = statistically unrelated from each other
	* **identically distributed** = all having been drawn from the same population distribution
* IID random variables = default model for random samples = default starting point of inference

$\pagebreak$

## Diagnostic Test
* Let $+$ and $-$ be the results, positive and negative respectively, of a diagnostic test
* Let $D$ = subject of the test has the disease, $D^c$ = subject does not
* **sensitivity** = $P(+\:|\:D)$ = probability that the test is positive given that the subject has the disease (the higher the better)
* **specificity** = $P(-\:|\:D^c)$ = probability that the test is negative given that the subject does not have the disease (the higher the better)
* **positive predictive value** = $P(D\:|\:+)$ = probability that that subject has the disease given that the test is positive
* **negative predictive value** = $P(D^c\:|\:-)$ = probability that the subject does not have the disease given the test is negative
* **prevalence of disease** = $P(D)$ = marginal probability of disease

### Example
* specificity of 98.5%, sensitivity = 99.7%, prevalence of disease = .1%
$$\begin{aligned}
P(D ~|~ +) & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\
& = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\
& = \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\
& =  .062
\end{aligned}$$
* low positive predictive value $\rightarrow$ due to low prevalence of disease and somewhat modest specificity
	* suppose it was know that the subject uses drugs and has regular intercourse with an HIV infect partner (his probability of being + is higher than suspected)
	* evidence implied by a positive test result


### Likelihood Ratios
* **diagnostic likelihood ratio** of a **positive** test result is defined as $$DLR_+ = \frac{sensitivity}{1-specificity} =  \frac{P(+\:|\:D)}{P(+\:|\:D^c)}$$
* **diagnostic likelihood ratio** of a **negative** test result is defined as $$DLR_- = \frac{1 - sensitivity}{specificity} =  \frac{P(-\:|\:D)}{P(-\:|\:D^c)}$$
* from Baye's Rules, we can derive the *positive predictive value* and *false positive value*
$$P(D\:|\:+) = \frac{P(+\:|\:D)P(D)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(1)}$$
$$P(D^c\:|\:+) = \frac{P(+\:|\:D^c)P(D^c)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(2)}$$
* if we divide equation $(1)$ over $(2)$, the quantities over have the same denominator so we get the following $$\frac{P(D\:|\:+)}{P(D^c\:|\:+)} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)} \times \frac{P(D)}{P(D^c)}$$ which can also be written as $$\mbox{post-test odds of D} = DLR_+ \times \mbox{pre-test odds of D}$$
	* **odds** = $p/(1-p)$
	* $\frac{P(D)}{P(D^c)}$ = **pre-test odds**, or odds of disease in absence of test
	* $\frac{P(D\:|\:+)}{P(+\:|\:D^c)}$ = **post-test odds**, or odds of disease given a positive test result
	* $DLR_+$ = factor by which the odds in the presence of a positive test can be multiplied to obtain the post-test odds
	* $DLR_-$ = relates the decrease in odds of disease after a negative result
* following the previous example, for sensitivity of 0.997 and specificity of 0.985, so the diagnostic likelihood ratios are as follows $$DLR_+ = .997/(1-.985) = 66 ~~~~~~ DLR_- =(1-.997)/.985 = 0.003$$
	* this indicates that the result of the positive test is the odds of disease is 66 times the pretest odds


$\pagebreak$

## Expected Values/Mean
* useful for characterizing a distribution (properties of distributions)
* **mean** = characterization of the center of the distribution = *expected value*
* expected value operation = ***linear*** $\rightarrow$ $E(aX +bY) = aE(X) + bE(Y)$
* **variance/standard deviation** = characterization of how spread out the distribution is
* _sample_ expected values for sample mean and variance will estimate the _population_ counterparts

* **population mean**
	* expected value/mean of a random variable = center of its distribution (center of mass)
	* ***discrete variables***
		* for $X$ with PMF $p(x)$, the population mean is defined as $$E[X] = \sum_{x} xp(x)$$ where the sum is taken over ***all*** possible values of $x$
		* $E[X]$ = center of mass of a collection of location and weights ${x,~p(x)}$
        * *coin flip example*: $E[X] = 0 \times (1-p) + 1 \times p = p$
	* ***continuous variable***
		* for $X$ with PDF $f(x)$, the expected value = the center of mass of the density
		* instead of summing over discrete values, the expectation ***integrates*** over a continuous function
			* PDF = $f(x)$
			* $\int xf(x)$ = area under the PDF curve = mean/expected value of $X$

* **sample mean**
	* sample mean estimates the population mean
		* sample mean = center of mass of observed data = empirical mean
        $$\bar X = \sum_{x}^n x_i p(x_i)$$ where $p(x_i) = 1/n$

```{r fig.width = 5, fig.height = 3, fig.align = 'center', message = F, warning = F}
# load relevant packages
library(UsingR); data(galton); library(ggplot2)
# plot galton data
g <- ggplot(galton, aes(x = child))
# add histogram for children data
g <- g + geom_histogram(fill = "salmon", binwidth=1, aes(y=..density..), colour="black")
# add density smooth
g <- g + geom_density(size = 2)
# add vertical line
g <- g + geom_vline(xintercept = mean(galton$child), size = 2)
# print graph
g
```


* **average of random variables** = a new random variable where its distribution has an expected value that is the **same** as the original distribution (centers are the same)
	* the mean of the averages = average of the original data $\rightarrow$ estimates average of the population
	* if $E[sample~mean]$ = population mean, then estimator for the sample mean is **unbiased**
		* [**derivation**] let $X_1$, $X_2$, $X_3$, ... $X_n$ be a collection of $n$ samples from the population with mean $\mu$
		* mean of this sample $$\bar X = \frac{X_1 + X_2 + X_3 +  .  + X_n}{n}$$
		* since $E(aX) = aE(X)$, the expected value of the mean is can be written as $$E\left[\frac{X_1 + X_2 + X_3 + ... + X_n}{n}\right] = \frac{1}{n} \times [E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)]$$
		* since each of the $E(X_i)$ is drawn from the population with mean $\mu$, the expected value of each sample should be $$E(X_i) = \mu$$
		* therefore $$\begin{aligned}
E\left[\frac{X_1 + X_2 + X_3 + ... + X_n}{n}\right] & = \frac{1}{n} \times [E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)]\\
& = \frac{1}{n} \times [\mu + \mu + \mu + ... + \mu]\\
& = \frac{1}{n} \times n \times \mu\\
& = \mu\\
\end{aligned}$$
* ***Note**: the more data that goes into the sample mean, the more concentrated its density/mass functions are around the population mean *

```{r fig.width = 6, fig.height = 3, fig.align = 'center', message = F, warning = F}
nosim <- 1000
# simulate data for sample size 1 to 4
dat <- data.frame(
	x = c(sample(1 : 6, nosim, replace = TRUE),
        apply(matrix(sample(1 : 6, nosim * 2, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1 : 6, nosim * 3, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1 : 6, nosim * 4, replace = TRUE), nosim), 1, mean)),
	size = factor(rep(1 : 4, rep(nosim, 4))))
# plot histograms of means by sample size
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.25, colour = "black")
g + facet_grid(. ~ size)
```


$\pagebreak$

## Variance

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# generate x value ranges
xvals <- seq(-10, 10, by = .01)
# generate data from normal distribution for sd of 1 to 4
dat <- data.frame(
    y = c(dnorm(xvals, mean = 0, sd = 1),
        dnorm(xvals, mean = 0, sd = 2),
        dnorm(xvals, mean = 0, sd = 3),
        dnorm(xvals, mean = 0, sd = 4)),
    x = rep(xvals, 4),
    factor = factor(rep(1 : 4, rep(length(xvals), 4)))
)
# plot 4 lines for the different standard deviations
ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)
```


* **variance** = measure of spread or dispersion, the expected squared distance of the variable from its mean (expressed in $X$'s units$^2$)
	- as we can see from above, higher variances $\rightarrow$ more spread, lower $\rightarrow$ smaller spread
	* $Var(X) = E[(X-\mu)^2] = E[X^2] - E[X]^2$
	* **standard deviation** $= \sqrt{Var(X)}$ $\rightarrow$ has same units as X
	* ***example***
        * for die roll, $E[X] = 3.5$
		* $E[X^2] = 1^2 \times 1/6 + 2^2 \times 1/6 + 3^2 \times 1/6 + 4^2 \times 1/6 + 5^2 \times 1/6 + 6^2 \times 1/6 = 15.17$
		* $Var(X) = E[X^2] - E[X]^2 \approx 2.92$
	* ***example***
        * for coin flip, $E[X] = p$
		* $E[X^2] = 0^2 \times (1 - p) + 1^2 \times p= p$
		* $Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1-p)$

### Sample Variance
* the **sample variance** is defined as $$S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1}$$

```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'}
grid.raster(readPNG("figures/6.png"))
```

* on the above line representing the population (in magenta), any subset of data (3 of 14 selected, marked in blue) will most likely have a variance that is ***lower than*** the population variance
* dividing by $n - 1$ will make the variance estimator ***larger*** to adjust for this fact $\rightarrow$ leads to more accurate estimation $\rightarrow$ $S^2$ = so called ***unbiased estimate of population variance***
    * $S^2$ is a random variable, and therefore has an associated population distribution
		* $E[S^2]$ = population variance, where $S$ = sample standard deviation
		* as we see from the simulation results below, with more data, the distribution for $S^2$ gets more concentrated around population variance

```{r fig.width = 4, fig.height = 3, fig.align = 'center'}
# specify number of simulations
nosim <- 10000;
# simulate data for various sample sizes
dat <- data.frame(
    x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
          apply(matrix(rnorm(nosim * 20), nosim), 1, var),
          apply(matrix(rnorm(nosim * 30), nosim), 1, var)),
    n = factor(rep(c("10", "20", "30"), c(nosim, nosim, nosim))) )
# plot density function for different sample size data
ggplot(dat, aes(x = x, fill = n)) + geom_density(size = 1, alpha = .2) +
	geom_vline(xintercept = 1, size = 1)
```

* ***Note**: for any variable, properties of the population = **parameter**, estimates of properties for samples = **statistic** *
	- below is a summary for the mean and variance for population and sample

```{r echo = FALSE, fig.width = 6, fig.height = 5, fig.align = 'center'}
grid.raster(readPNG("figures/8.png"))
```

* **distribution for mean of random samples**
	* expected value of the **mean** of distribution of means = expected value of the sample mean = population mean
		* $E[\bar X]=\mu$
	* expected value of the variance of distribution of means
		* $Var(\bar X) = \sigma^2/n$
		* as **n** becomes larger, the mean of random sample $\rightarrow$ more concentrated around the population mean $\rightarrow$ variance approaches 0
			* this again confirms that sample mean estimates population mean
	* ***Note**: normally we only have 1 sample mean (from collected sample) and can estimate the variance $\sigma^2$ $\rightarrow$ so we know a lot about the **distribution of the means** from the data observed *

* **standard error (SE)**
	* the standard error of the mean is defined as $$SE_{mean} = \sigma/\sqrt{n}$$
	* this quantity is effectively the standard deviation of the distribution of a statistic (i.e. mean)
	* represents variability of means


### Entire Estimator-Estimation Relationship
* Start with a sample
* $S^2$ = sample variance
	* estimates how variable the population is
	* estimates population variance $\sigma^2$
	* $S^2$ = a random variable and has its own distribution centered around $\sigma^2$
		* more concentrated around $\sigma^2$ as $n$ increases
* $\bar X$ = sample mean
	* estimates population mean $\mu$
	* $\bar X$ = a random variable and has its own distribution centered around $\mu$
		* more concentrated around $\mu$ as $n$ increases
		* variance of distribution of $\bar X = \sigma^2/n$
		* estimate of variance = $S^2/n$
		* estimate of standard error = $S/\sqrt{n}$ $\rightarrow$ "sample standard error of the mean"
			- estimates how variable sample means ($n$ size) from the population are

### Example - Standard Normal
* variance = 1
* means of **n** standard normals (sample) have standard deviation = $1/\sqrt{n}$

```{r message = F, warning = F}
# specify number of simulations with 10 as number of observations per sample
nosim <- 1000; n <-10
# estimated standard deviation of mean
sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
# actual standard deviation of mean of standard normals
1 / sqrt(n)
```

* `rnorm()` = generate samples from the standard normal
* `matrix()` = puts all samples into a nosim by $n$ matrix, so that each row represents a simulation with `nosim` observations
* `apply()` = calculates the mean of the $n$ samples
* `sd()` = returns standard deviation

### Example - Standard Uniform
* standard uniform $\rightarrow$ triangle straight line distribution $\rightarrow$ mean = 1/2 and variance = 1/12
* means of random samples of $n$ uniforms have have standard deviation of $1/\sqrt{12 \times n}$


```{r message = F, warning = F}
# estimated standard deviation of the sample means
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
# actual standard deviation of the means
1/sqrt(12*n)
```


### Example - Poisson
* $Poisson(x^2)$ have variance of $x^2$
* means of random samples of $n~ Poisson(4)$ have standard deviation of $2/\sqrt{n}$

```{r message = F, warning = F}
# estimated standard deviation of the sample means
sd(apply(matrix(rpois(nosim * n, lambda=4), nosim), 1, mean))
# actual standard deviation of the means
2/sqrt(n)
```


### Example - Bernoulli
* for $p = 0.5$, the Bernoulli distribution has variance of 0.25
* means of random samples of $n$ coin flips have standard deviations of $1 / (2 \sqrt{n})$

```{r message = F, warning = F}
# estimated standard deviation of the sample means
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), nosim), 1, mean))
# actual standard deviation of the means
1/(2*sqrt(n))
```

### Example - Father/Son

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# load data
library(UsingR); data(father.son);
# define son height as the x variable
x <- father.son$sheight
# n is the length
n<-length(x)
# plot histogram for son's heights
g <- ggplot(data = father.son, aes(x = sheight))
g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black")
g <- g + geom_density(size = 2, colour = "black")
g
# we calculate the parameters for variance of distribution and sample mean,
round(c(sampleVar = var(x),
	sampleMeanVar = var(x) / n,
	# as well as standard deviation of distribution and sample mean
	sampleSd = sd(x),
	sampleMeanSd = sd(x) / sqrt(n)),2)
```


$\pagebreak$

## Binomial Distribution
* **binomial random variable** = sum of **n** Bernoulli variables $$X = \sum_{i=1}^n X_i$$ where $X_1,\ldots,X_n = Bernoulli(p)$
	* PMF is defined as $$P(X=x) = {n \choose x}p^x(1-p)^{n-x}$$ where ${n \choose x}$ = number of ways selecting $x$ items out of $n$ options without replacement or regard to order and for $x=0,\ldots,n$
	* **combination** or "$n$ choose $x$" is defined as $${n \choose x} = \frac{n!}{x!(n-x)!}$$
	* the base cases are $${n \choose n} = {n \choose 0} = 1$$
* **Bernoulli distribution** = binary outcome
	* only possible outcomes
		* 1 = "success" with probability of $p$
		* 0 = "failure" with probability of $1 - p$
	* PMF is defined as $$P(X=x) = p^x(1 - p)^{1-x}$$
	* mean = $p$
	* variance = $p(1 - p)$

### Example
* of 8 children, whats the probability of 7 or more girls (50/50 chance)?
$${8 \choose 7}.5^7(1-.5)^{1} + {8 \choose 8}.5^8(1-.5)^{0} \approx 0.04$$

```{r message = F, warning = F}
# calculate probability using PMF
choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8
# calculate probability using CMF from distribution
pbinom(6, size = 8, prob = .5, lower.tail = FALSE)
```

* `choose(8, 7)` = R function to calculate n choose x
* `pbinom(6, size=8, prob =0.5, lower.tail=TRUE)` = probability of 6 or less successes out of 8 samples with probability of 0.5 (CMF)
	- `lower.tail=FALSE` = returns the complement, in this case it's the probability of greater than 6 successes out of 8 samples with probability of 0.5


$\pagebreak$

## Normal Distribution
* normal/Gaussian distribution for random variable X
	* notation = $X \sim N(\mu, \sigma^2)$
	* mean = $E[X] = \mu$
	* variance = $Var(X) = \sigma^2$
	* PMF is defined as $$f(x)=(2\pi\sigma^2)^{-1/2}e^{-(x-\mu)^2/2\sigma^2}$$
* $X \sim N(0, 1)$ = **standard normal distribution** (standard normal random variables often denoted using $Z_1, Z_2, \ldots$)
	- ***Note**: see below graph for reference for the following observations *
	* ~68% of data/normal density $\rightarrow$ between $\pm$ 1 standard deviation from $\mu$
	* ~95% of data/normal density $\rightarrow$ between $\pm$ 2 standard deviation from $\mu$
	* ~99% of data/normal density $\rightarrow$ between $\pm$ 3 standard deviation from $\mu$
	* $\pm$ 1.28 standard deviations from $\mu$ $\rightarrow$ 10$^{th}$ (-) and 90$^{th}$ (+) percentiles
	* $\pm$ 1.645 standard deviations from $\mu$ $\rightarrow$ 5$^{th}$ (-) and 95$^{th}$ (+) percentiles
	* $\pm$ 1.96 standard deviations from $\mu$ $\rightarrow$ 2.5$^{th}$ (-) and 97.5$^{th}$ (+) percentiles
	* $\pm$ 2.33 standard deviations from $\mu$ $\rightarrow$ 1$^{st}$ (-) and 99$^{th}$ (+) percentiles

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# plot standard normal
x <- seq(-3, 3, length = 1000)
g <- ggplot(data.frame(x = x, y = dnorm(x)),
            aes(x = x, y = y)) + geom_line(size = 2)
g <- g + geom_vline(xintercept = -3 : 3, size = 2)
g
```

* for any $X \sim N(\mu, \sigma^2)$, calculating the number of standard deviations each observation is from the mean ***converts*** the random variable to a ***standard normal*** (denoted as $Z$ below) $$Z=\frac{X-\mu}{\sigma} \sim N(0,1)$$
* conversely, a standard normal can then be converted to ***any normal distribution*** by multiplying by standard deviation and adding the mean $$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$
* `qnorm(n, mean=mu, sd=sd)` = returns the $n^{th}$ percentiles for the given normal distribution
* `pnorm(x, mean=mu, sd=sd, lower.tail=F)` = returns the probability of an observation drawn from the given distribution is larger in value than the specified threshold $x$

### Example
* the number of daily ad clicks for a company
is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50
* What's the probability of getting more than 1,160 clicks in a day?

```{r}
# calculate number of standard deviations from the mean
(1160 - 1020) / 50
# calculate probability using given distribution
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
# calculate probability using standard normal
pnorm(2.8, lower.tail = FALSE)
```

* therefore, it is not very likely (`r pnorm(2.8, lower.tail = FALSE)*100`% chance), since 1,160 is `r (1160 - 1020) / 50` standard deviations from the mean
* What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming days are independent and identically distributed)?

```{r}
qnorm(0.75, mean = 1020, sd = 50)
```

* therefore, `r qnorm(0.75, mean = 1020, sd = 50)` would represent the threshold that has more clicks than 75% of days


$\pagebreak$

## Poisson Distribution
* used to model counts
	* mean = $\lambda$
	* variance = $\lambda$
	* PMF is defined as $$P(X = x; \lambda)=\frac{\lambda^xe^{-\lambda}}{x!}$$ where $X = 0, 1, 2, ... \infty$
* modeling uses for Poisson distribution
	* count data
	* event-time/survival $\rightarrow$ cancer trials, some patients never develop and some do, dealing with the data for both ("censoring")
	* contingency tables $\rightarrow$ record results for different characteristic measurements
	* approximating binomials $\rightarrow$ instances where **n** is large and **p** is small (i.e. pollution on lung disease)
		* $X \sim Binomial(n, p)$
		* $\lambda = np$
	* rates $\rightarrow$ $X \sim Poisson(\lambda t)$
		* $\lambda = E[X/t]$ $\rightarrow$ expected count per unit of time
		* $t$ = total monitoring time
- `ppois(n, lambda = lambda*t)` = returns probability of $n$ or fewer events happening given the rate $\lambda$ and time $t$

### Example
* number of people that show up at a bus stop can be modeled with Poisson distribution with a mean of 2.5 per hour
* after watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?

```{r}
# calculate using distribution
ppois(3, lambda = 2.5 * 4)
```

* as we can see from above, there is a `r ppois(3, lambda = 2.5 * 4)*100`% chance for 3 or fewer people show up total at the bus stop during 4 hours of monitoring


### Example - Approximating Binomial Distribution
* flip a coin with success probability of 0.01 a total 500 times (low $p$, large $n$)
* what's the probability of 2 or fewer successes?

```{r}
# calculate correct probability from Binomial distribution
pbinom(2, size = 500, prob = .01)
# estimate probability using Poisson distribution
ppois(2, lambda=500 * .01)
```

* as we can see from above, the two probabilities (`r pbinom(2, size = 500, prob = .01)*100`% vs `r pbinom(2, size = 500, prob = .01)*100`%) are extremely close

$\pagebreak$

## Asymptotics
* **asymptotics** = behavior of statistics as sample size $\rightarrow$ $\infty$
* useful for simple statistical inference/approximations
* form basis for frequentist interpretation of probabilities ("Law of Large Numbers")


### Law of Large Numbers (LLN)
* IID sample statistic that estimates property of the sample (i.e. mean, variance) ***becomes*** the population statistic (i.e. population mean, population variance) as $n$ increases
* ***Note**: an estimator is **consistent** if it converges to what it is estimating *
* sample mean/variance/standard deviation are all ***consistent estimators*** for their population counterparts
	- $\bar X_n$ is average of the result of $n$ coin flips (i.e. the sample proportion of heads)
	- as we flip a fair coin over and over, it ***eventually converges*** to the true probability of a head

### Example - LLN for Normal and Bernoulli Distribution
* for this example, we will simulate 10000 samples from the normal and Bernoulli distributions respectively
* we will plot the distribution of sample means as $n$ increases and compare it to the population means

```{r fig.width = 6, fig.height = 3, fig.align = 'center', message = F, warning = F}
# load library
library(gridExtra)
# specify number of trials
n <- 10000
# calculate sample (from normal distribution) means for different size of n
means <- cumsum(rnorm(n)) / (1  : n)
# plot sample size vs sample mean
g <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0) + geom_line(size = 2)
g <- g + labs(x = "Number of obs", y = "Cumulative mean")
g <- g + ggtitle("Normal Distribution")
# calculate sample (coin flips) means for different size of n
means <- cumsum(sample(0 : 1, n , replace = TRUE)) / (1  : n)
# plot sample size vs sample mean
p <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
p <- p + geom_hline(yintercept = 0.5) + geom_line(size = 2)
p <- p + labs(x = "Number of obs", y = "Cumulative mean")
p <- p + ggtitle("Bernoulli Distribution (Coin Flip)")
# combine plots
grid.arrange(g, p, ncol = 2)
```

* as we can see from above, for both distributions the sample means undeniably approach the respective population means as $n$ increases


### Central Limit Theorem
* one of the most important theorems in statistics
* distribution of means of IID variables approaches the standard normal as sample size $n$ increases
* in other words, for large values of $n$, $$\frac{\mbox{Estimate} - \mbox{Mean of Estimate}}{\mbox{Std. Err. of Estimate}} = \frac{\bar X_n - \mu}{\sigma / \sqrt{n}}=\frac{\sqrt n (\bar X_n - \mu)}{\sigma} \longrightarrow N(0, 1)$$
* this translates to the distribution of the sample mean $\bar X_n$ is approximately $N(\mu, \sigma^2/n)$
	- distribution is centered at the population mean
	- with standard deviation = standard error of the mean
* typically the Central Limit Theorem can be applied when $n \geq 30$

### Example - CLT with Bernoulli Trials (Coin Flips)
- for this example, we will simulate $n$ flips of a possibly unfair coin
	- let $X_i$ be the 0 or 1 result of the $i^{th}$ flip of a possibly unfair coin
	+ sample proportion , $\hat p$, is the average of the coin flips
	+ $E[X_i] = p$ and $Var(X_i) = p(1-p)$
	+ standard error of the mean is $SE = \sqrt{p(1-p)/n}$
+ in principle, normalizing the random variable $X_i$, we should get an approximately standard normal distribution $$\frac{\hat p - p}{\sqrt{p(1-p)/n}} \sim N(0,~1)$$
- therefore, we will flip a coin $n$ times, take the sample proportion of heads (successes with probability $p$), subtract off 0.5 (ideal sample proportion) and multiply the result by $\frac{1}{2 \sqrt{n}}$ and compare it to the standard normal

```{r, echo = FALSE, fig.width=6, fig.height = 3, fig.align='center'}
# specify number of simulations
nosim <- 1000
# convert to standard normal
cfunc <- function(x, n) 2 * sqrt(n) * (mean(x) - 0.5)
# simulate data for sample sizes 10, 20, and 30
dat <- data.frame(
	x = c(apply(matrix(sample(0:1, nosim*10, replace=TRUE), nosim), 1, cfunc, 10),
        apply(matrix(sample(0:1, nosim*20, replace=TRUE), nosim), 1, cfunc, 20),
        apply(matrix(sample(0:1, nosim*30, replace=TRUE), nosim), 1, cfunc, 30)),
	size = factor(rep(c(10, 20, 30), rep(nosim, 3))))
# plot histograms for the trials
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3,
	colour = "black", aes(y = ..density..))
# plot standard normal distribution for reference
g <- g + stat_function(fun = dnorm, size = 2)
# plot panel plots by sample size
g + facet_grid(. ~ size)
```

* now, we can run the same simulation trials for an extremely unfair coin with $p$ = 0.9

```{r, echo = FALSE, fig.width=6, fig.height = 3, fig.align='center'}
# specify number of simulations
nosim <- 1000
# convert to standard normal
cfunc <- function(x, n) sqrt(n) * (mean(x) - 0.9) / sqrt(.1 * .9)
# simulate data for sample sizes 10, 20, and 30
dat <- data.frame(
	x = c(apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 10, replace = TRUE),
                     nosim), 1, cfunc, 10),
        apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 20, replace = TRUE),
                     nosim), 1, cfunc, 20),
        apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 30, replace = TRUE),
                     nosim), 1, cfunc, 30)),
	size = factor(rep(c(10, 20, 30), rep(nosim, 3))))
# plot histograms for the trials
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3,
	colour = "black", aes(y = ..density..))
# plot standard normal distribution for reference
g <- g + stat_function(fun = dnorm, size = 2)
# plot panel plots by sample size
g + facet_grid(. ~ size)
```

* as we can see from both simulations, the converted/standardized distribution of the samples convert to the standard normal distribution
* ***Note**: speed at which the normalized coin flips converge to normal distribution depends on how biased the coin is (value of $p$) *
* ***Note**: does not guarantee that the normal distribution will be a good approximation, but just that eventually it will be a good approximation as n $\rightarrow \infty$ *


### Confidence Intervals - Normal Distribution/Z Intervals
* **Z confidence interval** is defined as $$Estimate \pm ZQ \times SE_{Estimate}$$ where $ZQ$ = quantile from the standard normal distribution
* according to CLT, the sample mean, $\bar X$, is approximately normal with mean $\mu$ and sd $\sigma / \sqrt{n}$
* **95% confidence interval for the population mean $\mu$** is defined as $$\bar X \pm 2\sigma/\sqrt{n}$$ for the sample mean $\bar X \sim N(\mu, \sigma^2/n)$
	* you can choose to use 1.96 to be more accurate for the confidence interval
	* $P(\bar{X} > \mu + 2\sigma/\sqrt{n}~or~\bar{X} < \mu - 2\sigma/\sqrt{n}) = 5\%$
    * **interpretation**: if we were to repeatedly draw samples of size $n$ from the population and construct this confidence interval for each case, approximately 95% of the intervals will contain $\mu$
* confidence intervals get **narrower** with less variability or
larger sample sizes
* ***Note**: Poisson and binomial distributions have exact intervals that don't require CLT *
* ***example***
	- for this example, we will compute the 95% confidence interval for sons height data in inches

```{r message = F, warning = F}
# load son height data
data(father.son); x <- father.son$sheight
# calculate confidence interval for sons height in inches
mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x))
```


### Confidence Interval - Bernoulli Distribution/Wald Interval
* for Bernoulli distributions, $X_i$ is 0 or 1 with success probability $p$ and the variance is $\sigma^2 = p(1 - p)$
* the confidence interval takes the form of $$\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$
* since the population proportion $p$ is unknown, we can use the sampled proportion of success $\hat{p} = X/n$ as estimate
* $p(1-p)$ is largest when $p = 1/2$, so 95% confidence interval can be calculated by $$\begin{aligned}
\hat{p} \pm Z_{0.95} \sqrt{\frac{0.5(1-0.5)}{n}} & = \hat{p} \pm qnorm(.975) \sqrt{\frac{1}{4n}}\\
& = \hat{p} \pm 1.96 \sqrt{\frac{1}{4n}}\\
& = \hat{p} \pm \frac{1.96}{2} \sqrt{\frac{1}{n}}\\
& \approx \hat{p} \pm \frac{1}{\sqrt{n}}\\
\end{aligned}$$
	* this is known as the **Wald Confidence Interval** and is useful in ***roughly estimating*** confidence intervals
	* generally need $n$ = 100 for 1 decimal place, 10,000 for 2, and 1,000,000 for 3
* ***example***
	- suppose a random sample of 100 likely voters, 56 intent to vote for you, can you secure a victory?
	- we can use the Wald interval to quickly estimate the 95% confidence interval
	- as we can see below, because the interval [`r 0.56 + c(-1, 1) * 1/sqrt(100)`] contains values below 50%, victory is not guaranteed
	- `binom.test(k, n)$conf` = returns confidence interval binomial distribution (collection of Bernoulli trial) with `k` successes in `n` draws

```{r message = F, warning = F}
# define sample probability and size
p = 0.56; n = 100
# Wald interval
c("WaldInterval" = p + c(-1, 1) * 1/sqrt(n))
# 95% confidence interval
c("95CI" = p + c(-1, 1) * qnorm(.975) * sqrt(p * (1-p)/n))
# perform binomial test
binom.test(p*100, n*100)$conf.int
```


### Confidence Interval - Binomial Distribution/Agresti-Coull Interval
* for a binomial distribution with smaller values of $n$ (when $n$ < 30, thus not large enough for CLT), often time the normal confidence intervals, as defined by $$\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$ **do not** provide accurate estimates

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# simulate 1000 samples of size 20 each
n <- 20; nosim <- 1000
# simulate for p values from 0.1 to 0.9
pvals <- seq(.1, .9, by = .05)
# calculate the confidence intervals
coverage <- sapply(pvals, function(p){
	# simulate binomial data
	phats <- rbinom(nosim, prob = p, size = n) / n
	# calculate lower 95% CI bound
	ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
	# calculate upper 95% CI bound
	ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
	# calculate percent of intervals that contain p
	mean(ll < p & ul > p)
})
# plot CI results vs 95%
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0)
```

* as we can see from above, the interval do not provide adequate coverage as 95% confidence intervals (frequently only provide 80 to 90% coverage)
* we can construct the **Agresti-Coull Interval**, which is defined uses the adjustment $$\hat{p} = \frac{X+2}{n+4}$$ where we effectively ***add 2*** to number of successes, $X$, and ***add 2*** to number of failure
* therefore the interval becomes $$\frac{X+2}{n+4} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$
* ***Note**: interval tend to be **conservative** *
* ***example***

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# simulate 1000 samples of size 20 each
n <- 20; nosim <- 1000
# simulate for p values from 0.1 to 0.9
pvals <- seq(.1, .9, by = .05)
# calculate the confidence intervals
coverage <- sapply(pvals, function(p){
	# simulate binomial data with Agresti/Coull Interval adjustment
	phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
		# calculate lower 95% CI bound
	ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
	# calculate upper 95% CI bound
	ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
	# calculate percent of intervals that contain p
	mean(ll < p & ul > p)
})
# plot CI results vs 95%
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0)
````

* as we can see from above, the coverage is much better for the 95% interval
* in fact, all of the estimates are more conservative as we previously discussed, indicating the Agresti-Coull intervals are ***wider*** than the regular confidence intervals


### Confidence Interval - Poisson Interval
* for $X \sim Poisson(\lambda t)$
	* estimate rate $\hat{\lambda} = X/t$
	* $var(\hat{\lambda}) = \lambda/t$
	* variance estimate $= \hat{\lambda}/t$
* so the confidence interval is defined as $$\hat \lambda \pm z_{1-\alpha/2}\sqrt{\frac{\lambda}{t}}$$
	* however, for small values of $\lambda$ (few events larger time interval), we **should not** use the asymptotic interval estimated
	* ***example***
		- for this example, we will go through a specific scenario as well as a simulation exercise to demonstrate the ineffectiveness of asymptotic intervals for small values of $\lambda$
		- nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
		+ `poisson.test(x, T)$conf` = returns Poisson 95% confidence interval for given `x` occurrence over `T` time period

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# define parameters
x <- 5; t <- 94.32; lambda <- x / t
# calculate confidence interval
round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)
# return accurate confidence interval from poisson.test
poisson.test(x, T = 94.32)$conf
# small lambda simulations
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 100
# calculate coverage using Poisson intervals
coverage <- sapply(lambdavals, function(lambda){
	# calculate Poisson rates
	lhats <- rpois(nosim, lambda = lambda * t) / t
	# lower bound of 95% CI
	ll <- lhats - qnorm(.975) * sqrt(lhats / t)
	# upper bound of 95% CI
	ul <- lhats + qnorm(.975) * sqrt(lhats / t)
	# calculate percent of intervals that contain lambda
	mean(ll < lambda & ul > lambda)
})
# plot CI results vs 95%
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0)
```

* as we can see above, for small values of $\lambda = X/t$, the confidence interval produced by the asymptotic interval is ***not*** an accurate estimate of the actual 95% interval (not enough coverage)
* however, as $t \to \infty$, the interval becomes the ***true 95% interval***


```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# small lambda simulations
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 1000
# calculate coverage using Poisson intervals
coverage <- sapply(lambdavals, function(lambda){
	# calculate Poisson rates
	lhats <- rpois(nosim, lambda = lambda * t) / t
	# lower bound of 95% CI
	ll <- lhats - qnorm(.975) * sqrt(lhats / t)
	# upper bound of 95% CI
	ul <- lhats + qnorm(.975) * sqrt(lhats / t)
	# calculate percent of intervals that contain lambda
	mean(ll < lambda & ul > lambda)
})
# plot CI results vs 95%
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(0, 1.0)
```

* as we can see from above, as $t$ increases, the Poisson intervals become closer to the actual 95% confidence intervals


### Confidence Intervals - T Distribution(Small Samples)
* **t** confidence interval is defined as $$Estimate \pm TQ \times SE_{Estimate} = \bar X \pm \frac{t_{n-1} S}{\sqrt{n}}$$
	* $TQ$ = quantile from T distribution
    * $t_{n-1}$ = relevant quantile
    * $t$ interval assumes data is IID normal so that $$\frac{\bar X - \mu}{S/\sqrt{n}}$$ follows Gosset's $t$ distribution with $n-1$ degrees of freedom
    * works well with data distributions that are roughly symmetric/mound shaped, and ***does not*** work with skewed distributions
        * skewed distribution $\rightarrow$ meaningless to center interval around the mean $\bar X$
        * logs/median can be used instead
    * paired observations (multiple measurements from same subjects) can be analyzed by t interval of differences
    * as more data collected (large degrees of freedom), t interval $\rightarrow$ z interval
    * `qt(0.975, df=n-1)` = calculate the relevant quantile using t distribution

```{r fig.width = 6, fig.height = 3, fig.align = 'center', message = F, warning = F}
# Plot normal vs t distributions
k <- 1000; xvals <- seq(-5, 5, length = k); df <- 10
d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),x = xvals,
              dist = factor(rep(c("Normal", "T"), c(k,k))))
g <- ggplot(d, aes(x = x, y = y))
g <- g + geom_line(size = 2, aes(colour = dist)) + ggtitle("Normal vs T Distribution")
# plot normal vs t quantiles
d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),p = pvals)
h <- ggplot(d, aes(x= n, y = t))
h <- h + geom_abline(size = 2, col = "lightblue")
h <- h + geom_line(size = 2, col = "black")
h <- h + geom_vline(xintercept = qnorm(0.975))
h <- h + geom_hline(yintercept = qt(0.975, df)) + ggtitle("Normal vs T Quantiles")
# plot 2 graphs together
grid.arrange(g, h, ncol = 2)
```


* William Gosset's **t** Distribution ("Student's T distribution")
	* test = Gosset's pseudoname which he published under
	* indexed/defined by ***degrees of freedom***, and becomes more like standard normal as degrees of freedom gets larger
	* thicker tails centered around 0, thus confidence interval = ***wider*** than Z interval (more mass concentrated away from the center)
	* for ***small*** sample size (value of n), normalizing the distribution by $\frac{\bar X - \mu}{S/\sqrt{n}}$ $\rightarrow$ t distribution, ***not*** the standard normal distribution
		* $S$ = standard deviation may be inaccurate, as the std of the data sample may not be truly representative of the population std
        * using the Z interval here thus may produce an interval that is too ***narrow***

### Confidence Interval - Paired T Tests
* compare observations for the same subjects over two different sets of data (i.e. different times, different treatments)
* the confidence interval is defined by $$ \bar X_1 - \bar X_2 \pm \frac{t_{n-1} S}{\sqrt{n}}$$ where $\bar X_1$ represents the first observations and $\bar X_2$ the second set of observations
* `t.test(difference)` = performs group mean t test and returns metrics as results, which includes the confidence intervals
	- `t.test(g2, g1, paired = TRUE)` = performs the same paired t test with data directly
* ***example***
	- the data used here is for a study of the effects of two soporific drugs (increase in hours of sleep compared to control) on 10 patients

```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F}
# load data
data(sleep)
# plot the first and second observations
g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID)))
g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5)
g
# define groups
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
# define difference
difference <- g2 - g1
# calculate mean and sd of differences
mn <- mean(difference); s <- sd(difference); n <- 10
# calculate intervals manually
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
# perform the same test to get confidence intervals
t.test(difference)
t.test(g2, g1, paired = TRUE)
```


### Independent Group **t** Intervals - Same Variance
* compare two groups in randomized trial ("A/B Testing")
* cannot use the paired t test because the groups are independent and may have different sample sizes
* perform randomization to balance unobserved covariance that may otherwise affect the result
* $t$ confidence interval for $\mu_y - \mu_x$ is defined as $$\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}$$
    * $t_{n_x + n_y - 2, 1 - \alpha/2}$ = relevant quantile
    * $n_x + n_y - 2$ = degrees of freedom
    * $S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}$ = standard error
    * $S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$ = pooled variance estimator
        * this is effectively a weighted average between the two variances, such that different sample sizes are taken in to account
        * For equal sample sizes, $n_x = n_y$, $S_p^2 = \frac{S_x^2 + S_y^2}{2}$ (average of variance of two groups)
    * ***Note:** this interval assumes **constant variance** across two groups; if variance is different, use the next interval *

### Independent Group t Intervals - Different Variance
* confidence interval for $\mu_y - \mu_x$ is defined as $$\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}$$
    * $t_{df}$ = relevant quantile with df as defined below
    * ***Note**: normalized statistic does not follow t distribution but can be approximated through the formula with df defined below *
        $$df = \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2}
        {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) +
        \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)}$$
        * $\left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}$ = standard error

* Comparing other kinds of data
    * binomial $\rightarrow$ relative risk, risk difference, odds ratio
    * binomial $\rightarrow$ Chi-squared test, normal approximations, exact tests
    * count $\rightarrow$ Chi-squared test, exact tests

* R commands
    * t Confidence Intervals
        * `mean + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)`
            * ***c(-1, 1)*** = plus and minus, $\pm$

    * Difference Intervals (all equivalent)
        * `mean2 - mean1 + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)`
            * ***n*** = number of paired observations
            * ***qt(0.975, n - 1)*** =  relevant quantile for paired
            * ***qt(0.975, n$_x$ + n$_y$ - 2)*** =  relevant quantile for independent
        * `t.test(mean2 - mean1)`
        * `t.test(data2, data1, paired = TRUE, var.equal = TRUE)`
            * ***paired*** = whether or not the two sets of data are paired (same subjects different observations for treatment) $\rightarrow$ `TRUE` for paired, `FALSE` for independent
            * ***var.equal*** = whether or not the variance of the datasets should be treated as equal $\rightarrow$ `TRUE` for same variance, `FALSE` for unequal variances
        * `t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)`
            * ***relevel(factor, ref)*** = reorders the levels in the factor so that "ref" is changed to the first level $\rightarrow$ doing this here is so that the second set of measurements come first (1, 2 $\rightarrow$ 2, 1) in order to perform mean$_2$ - mean$_1$
            * ***I(object)*** = prepend the class "AsIs" to the object
            * ***Note**: I(relevel(group, 2)) = explanatory variable, must be **factor** and have **two levels** *

$\pagebreak$

## Hypothesis Testing
* Hypothesis testing = making decisions using data
    * **null** hypothesis (**H$_0$**) = status quo
    * assumed to be ***true*** $\rightarrow$ statistical evidence required to reject it for **alternative** or "research" hypothesis (**H$_a$**)
        * alternative hypothesis typically take form of >, < or $\ne$
    * **Results**

Truth | Decide | Result |
---|---|---|
$H_0$ | $H_0$ | Correctly accept null |
$H_0$ | $H_a$ | Type I error |
$H_a$ | $H_a$ | Correctly reject null |
$H_a$ | $H_0$ | Type II error |

* **$\alpha$** = Type I error rate
    * probability of ***rejecting*** the null hypothesis when the hypothesis is ***correct***
    * $\alpha$ = 0.05 $\rightarrow$ standard for hypothesis testing
    * ***Note**: as Type I error rate increases, Type II error rate decreases and vice versa *

* for large samples (large n), use the **Z Test** for $H_0:\mu = \mu_0$
    * **$H_a$:**
        * $H_1: \mu < \mu_0$
        * $H_2: \mu \neq \mu_0$
        * $H_3: \mu > \mu_0$
    * Test statistic $TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$
    * Reject the null hypothesis $H_0$ when
        * $H_1: TS \leq Z_{\alpha}$ OR $-Z_{1 - \alpha}$
        * $H_2: |TS| \geq Z_{1 - \alpha / 2}$
        * $H_3: TS \geq Z_{1 - \alpha}$
    * ***Note**: In case of $\alpha$ = 0.05 (most common), $Z_{1-\alpha}$ = 1.645 (95 percentile) *
    * $\alpha$ = low, so that when $H_0$ is rejected, original model $\rightarrow$ wrong or made an error (low probability)

* For small samples (small n), use the **T Test** for $H_0:\mu = \mu_0$
    * **$H_a$:**
        * $H_1: \mu < \mu_0$
        * $H_2: \mu \neq \mu_0$
        * $H_3: \mu > \mu_0$
    * Test statistic $TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$
    * Reject the null hypothesis $H_0$ when
        * $H_1: TS \leq T_{\alpha}$ OR $-T_{1 - \alpha}$
        * $H_2: |TS| \geq T_{1 - \alpha / 2}$
        * $H_3: TS \geq T_{1 - \alpha}$
    * ***Note**: In case of $\alpha$ = 0.05 (most common), $T_{1-\alpha}$ = `qt(.95, df = n-1)` *
    * R commands for T test:
        * `t.test(vector1 - vector2)`
        * `t.test(vector1, vector2, paired = TRUE)`
            * `alternative` argument can be used to specify one-sided tests: `less` or `greater`
            * `alternative` default = `two-sided`
        * prints test statistic (`t`), degrees of freedom (`df`), `p-value`, 95% confidence interval, and mean of sample
            * confidence interval in units of data, and can be used to intepret the practical significance of the results

* **rejection region** = region of TS values for which you reject $H_0$
* **power** = probability of rejecting $H_0$
    * power is used to calculate sample size for experiments

* **two-sided tests** $\rightarrow$ $H_a: \mu \neq \mu_0$
    * reject $H_0$ only if test statistic is too larger/small
    * for $\alpha$ = 0.05, split equally to 2.5% for upper and 2.5% for lower tails
        * equivalent to $|TS| \geq T_{1 - \alpha / 2}$
        * example: for T test, `qt(.975, df)` and `qt(.025, df)`
    * ***Note**: failing to reject one-sided test = fail to reject two-sided*

* **tests vs confidence intervals**
    * ($1-\alpha$)% confidence interval for $\mu$ = set of all possible values that fail to reject $H_0$
    * if ($1-\alpha$)% confidence interval contains $\mu_0$, fail to reject $H_0$

* **two-group intervals/test**
    * Rejection rules the same
    * Test $H_0$: $\mu_1 = \mu_2$ $\rightarrow$ $\mu_1 - \mu_2 = 0$
    * Test statistic:
    $$\frac{Estimate - H_0 Value}{SE_{Estimate}} = \frac{\bar X_1 - \bar X_2 - 0}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}$$
    * R Command
        * `t.test(values ~ factor, paired = FALSE, var.equal = TRUE, data = data)`
            * `paired = FALSE` = independent values
            * `factor` argument must have only two levels

* **p values**
    * most common measure of statistical significance
    * **p-value** = probability under the null hypothesis of obtaining evidence as extreme or more than that of the obtained
        * Given that $H_0$ is true, how likely is it to obtain the result (test statistic)?
    * **attained significance level** = smallest value for $\alpha$ for which $H_0$ is rejected $\rightarrow$ equivalent to p-value
        * if p-value < $\alpha$, reject $H_0$
        * for two-sided tests, double the p-values
    * if p-value is small, either $H_0$ is true AND the obeserved is a rare event **OR** $H_0$ is false
    * R Command
        * p-value = `pt(statistic, df, lower.tail = FALSE)`
            * `lower.tail = FALSE` = returns the probability of getting a value from the t distribution that is larger than the test statistic
        * Binomial (coin flips)
            * probability of getting x results out of n trials and event probability of p = `pbinom(x, size = n, prob = p, lower.tail = FALSE)`
            * two-sided interval (testing for $\ne$): find the smaller of two one-sided intervals (X < value, X > value), and double the result
            * ***Note**: `lower.tail = FALSE` = strictly greater *
        * Poisson
            * probability of getting x results given the rate r = `ppois(x - 1, r, lower.tail = FALSE)`
            * `x - 1` is used here because the upper tail includes the specified number (since we want greater than x, we start at x - 1)
            * `r` = events that should occur given the rate (multiplied by 100 to yield an integer)
            * ***Note**: `lower.tail = FALSE` = strictly greater *

$\pagebreak$

## Power
* **Power** = probability of rejecting the null hypothesis when it is false (the more power the better)
    * most often used in designing studies so that there's a reasonable chance to detect the alternative hypothesis if the alternative hypothesis is true
* $\beta$ = probability of type II error = failing to reject the null hypothesis when it's false
* power = $1 - \beta$
* ***example***
    * $H_0: \mu = 30 \to \bar X \sim N(\mu_0, \sigma^2/n)$
    * $H_a: \mu > 30 \to \bar X \sim N(\mu_a, \sigma^2/n)$
    * Power: $$Power = P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~;~ \mu = \mu_a \right)$$
        * ***Note**: the above function depends on value of $\mu_a$ *
        * ***Note**: as $\mu_a$ approaches 30, power approaches $\alpha$ *
	* assuming the sample mean is normally distributed, $H_0$ is rejected when $\frac{\bar X - 30}{\sigma /\sqrt{n}} > Z_{1-\alpha}$
    * or, $\bar X > 30 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}$
* R commands:
    * `alpha = 0.05; z = qnorm(1-alpha)` $\rightarrow$ calculates $Z_{1-\alpha}$
    * `pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FALSE)` $\rightarrow$ calculates the probability of getting a sample mean that is larger than $Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}$ given that the population mean is $\mu_a$
        * ***Note**: using `mean = mu0` in the function would = $\alpha$ *
    * Power curve behavior
        * Power increases as $mu_a$ increases $\rightarrow$ we are more likely to detect the difference in $mu_a$ and $mu_0$
        * Power increases as **n** increases $\rightarrow$ with more data, more likely to detect any alternative $mu_a$

```{r, fig.align='center', fig.height=4, fig.width=6}
library(ggplot2)
mu0 = 30; mua = 32; sigma = 4; n = 16
alpha = 0.05
z = qnorm(1 - alpha)
nseq = c(8, 16, 32, 64, 128)
mu_a = seq(30, 35, by = 0.1)
power = sapply(nseq, function(n)
    pnorm(mu0 + z * sigma / sqrt(n), mean = mu_a, sd = sigma / sqrt(n),
          lower.tail = FALSE)
    )
colnames(power) <- paste("n", nseq, sep = "")
d <- data.frame(mu_a, power)
library(reshape2)
d2 <- melt(d, id.vars = "mu_a")
names(d2) <- c("mu_a", "n", "power")
g <- ggplot(d2,
            aes(x = mu_a, y = power, col = n)) + geom_line(size = 2)
g
```


* **Solving for Power**
    * When testing $H_a : \mu > \mu_0$ (or $<$ or $\neq$)
        $$Power = 1 - \beta = P\left(\bar X > \mu_0 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)$$
        where $\bar X \sim N(\mu_a, \sigma^2 / n)$
    * Unknowns = $\mu_a$, $\sigma$, $n$, $\beta$
    * Knowns = $\mu_0$, $\alpha$
    * Specify any 3 of the unknowns and you can solve for the remainder; most common are two cases
        1. Given power desired, mean to detect, variance that we can tolerate, find the **n** to produce desired power (designing experiment/trial)
        2. Given the size **n** of the sample, find the power that is achievable (finding the utility of experiment)
    * ***Note**: for $H_a: \mu \neq mu_0$, calculated one-sided power using $z_{1-\alpha / 2}$; however, the power calculation here exclusdes the probability of getting a large TS in the opposite direction of the truth, but this is only applicable when $\mu_a$ and $\mu_0$ are close together*

* **Power Behavior**
    * Power increases as $\alpha$ becomes larger
    * Power of one-sided test $>$ power of associated two-sided test
    * Power increases as $\mu_a$ gets further away from $\mu_0$
    * Power increases as **n** increases (sample mean has less variability)
    * Power increases as $\sigma$ decreases (again less variability)
    * Power usually depends only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$, and not $\mu_a$, $\sigma$, and $n$
        * **effect size** = $\frac{\mu_a - \mu_0}{\sigma}$ $\rightarrow$ unit free, can be interpretted across settings

* **T-test Power**
    * for Gossett's T test,
    $$Power = P\left(\frac{\bar X - \mu_0}{S/\sqrt{n}} > t_{1-\alpha, n-1} ; \mu = \mu_a \right)$$
        * $\frac{\bar X - \mu_0}{S/\sqrt{n}}$ does not follow a t distribution if the true mean is $\mu_a$ and NOT $\mu_0$ $\rightarrow$ follows a non-central t distribution instead
    * `power.t.test` = evaluates the non-central t distribution and solves for a parameter given all others are specified
        * `power.t.test(n = 16, delta = 0.5, sd = 1, type = "one.sample", alt = "one.sided")$power` = calculates power with inputs of n, difference in means, and standard deviation
            * `delta` = argument for difference in means
            * ***Note**: since effect size = `delta/sd`, as `n`, `type`, and `alt` are held constant, any distribution with the same effect size will have the same power *
        * `power.t.test(power = 0.8, delta = 0.5, sd = 1, type = "one.sample", alt = "one.sided")$n` = calculates size n with inputs of power, difference in means, and standard deviation
            * ***Note**: n should always be rounded up (ceiling) *

$\pagebreak$

## Multiple Testing
* Hypothesis testing/significant analysis commonly overused
* correct for multiple testing to avoid false positives/conclusions (two key components)
    1. error measure
    2. correction
* multiple testing is needed because of the increase in  ubiquitous data collection technology and analysis
    * DNA sequencing machines
    * imaging patients in clinical studies
    * electronic medical records
    * individualized movement data (fitbit)

### Type of Errors

 | Actual $H_0$ = True | Actual $H_a$ = True |  Total
----------------------|--------------|--------------|---------
Conclude $H_0$ = True (non-significant) |      $U$     |      $T$     |  $m-R$
Conclude $H_a$ = True (significant) |      $V$     |      $S$     |   $R$
**Total**               |     $m_0$    |    $m-m_0$   |   $m$

* **$m_0$** = number of true null hypotheses, or cases where $H_0$ = actually true (unknown)
* **$m - m_0$** = number of true alternative hypotheses, or cases where $H_a$ = actually true (unknown)
* **$R$** = number of null hypotheses rejected, or cases where $H_a$ = concluded to be true (measurable)
* **$m - R$** = number of null hypotheses that failed to be rejected, or cases where $H_0$ = concluded to be true (measurable)
* **$V$** = Type I Error / false positives, concludes $H_a$ = True when $H_0$ = actually True
* **$T$** = Type II Error / false negatives, concludes $H_0$ = True when $H_a$ = actually True
* **$S$** = true positives, concludes $H_a$ = True when $H_a$ = actually True
* **$U$** = true negatives, concludes $H_0$ = True when $H_0$ = actually True

### Error Rates
* ***false positive rate*** = rate at which false results are called significant $E[\frac{V}{m_0}]$ $\rightarrow$ average fraction of times that $H_a$ is claimed to be true when $H_0$ is actually true
	* ***Note**: mathematically equal to type I error rate $\rightarrow$ false positive rate is associated with a post-prior result, which is the expected number of false positives divided by the total number of hypotheses under the real combination of true and non-true null hypotheses (disregarding the "global null" hypothesis). Since the false positive rate is a parameter that is not controlled by the researcher, it cannot be identified with the significance level, which is what determines the type I error rate. *
* ***family wise error rate (FWER)*** = probability of at least one false positive $Pr(V \ge 1)$
* ***false discovery rate (FDR)*** = rate at which claims of significance are false $E[\frac{V}{R}]$

* **controlling error rates (adjusting $\alpha$)**
    * false positive rate
        * if we call all $P<\alpha$ significant (reject $H_0$), we are expected to get $\alpha \times m$ false positives, where $m$ = total number of hypothesis test performed
        * with high values of $m$, false positive rate is very large as well
    * family-wise error rate (FWER)
        * controlling FWER = controlling the probability of even one false positive
        * *bonferroni* correction (oldest multiple testing correction)
            * for $m$ tests, we want $Pr(V \ge 1) < \alpha$
            * calculate P-values normally, and deem them significant if and only if $P < \alpha_{fewer} = \alpha / m$
        * easy to calculate, but tend to be very ***conservative***
    * false discovery rate (FDR)
        * most popular correction = controlling FDR
        * for $m$ tests, we want $E[\frac{V}{R}] \le \alpha$
        * calculate P-values normally and sort some from smallest to largest $\rightarrow$ $P_{(1)},P_{(1)}, ... , P_{(m)}$
        * deem the P-values significant if $P_{(i)} \le \alpha \times \frac{i}{m}$
        * easy to calculate, less conservative, but allows for more false positives and may behave strangely under dependence (related hypothesis tests/regression with different variables)
    * ***example***
        * 10 P-values with $\alpha = 0.20$

        ![alt text](figures/9.png)

* **adjusting for p-values**
    * ***Note**: changing P-values will fundamentally change their properties but they can be used directly without adjusting $/alpha$ *
    * _bonferroni_ (FWER)
        * $P_i^{fewer} = max(mP_i, 1)$ $\rightarrow$ since p cannot exceed value of 1
        * deem P-values significant if $P_i^{fewer} < \alpha$
        * similar to controlling FWER

### Example

```{r}
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  x <- rnorm(20)
  # First 500 beta=0, last 500 beta=2
  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
  # calculating p-values by using linear model; the [2, 4] coeff in result = pvalue
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
# Controls FDR (Benjamin Hochberg)
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
```


$\pagebreak$

## Resample Inference
* **Bootstrap** = useful tool for constructing confidence intervals and caclulating standard errors for difficult statistics
    * ***principle*** = if a statistic's (i.e. median) sampling distribution is unknown, then use distribution defined by the data to approximate it
    * ***procedures***
        1. simulate $n$ observations **with replacement** from the observed data $\rightarrow$ results in $1$ simulated complete data set
        2. calculate desired statistic (i.e. median) for each simulated data set
        3. repeat the above steps $B$ times, resulting in $B$ simulated statistics
        4. these statistics are approximately drawn from the sampling distribution of the true statistic of $n$ observations
        5. perform one of the following
            * plot a histogram
            * calculate standard deviation of the statistic to estimate its standard error
            * take quantiles (2.5^th^ and 97.5^th^) as a confidence interval for the statistic ("_bootstrap CI_")
    * ***example***
        * Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations $\rightarrow$ approximate sampling distribution

```{r fig.align='center', fig.height=4, fig.width=6, message = F, warning = F}
# load data
library(UsingR); data(father.son)
# observed dataset
x <- father.son$sheight
# number of simulated statistic
B <- 1000
# generate samples
resamples <- matrix(
    sample(x,               # sample to draw frome
           n * B,           # draw B datasets with n observations each
           replace = TRUE), # cannot draw n*B elements from x (has n elements) without replacement
    B, n)                   # arrange results into n x B matrix
                            # (every row = bootstrap sample with n observations)
# take median for each row/generated sample
medians <- apply(resamples, 1, median)
# estimated standard error of median
sd(medians)
# confidence interval of median
quantile(medians, c(.025, .975))
# histogram of bootstraped samples
hist(medians)
```

* ***Note:** better percentile bootstrap confidence interval = "bias corrected and accelerated interval" in `bootstrap` package*

* **Permutation Tests**
    * ***procedures***
        * compare groups of data and test the null hypothesis that the distribution of the observations from each group = same
            * ***Note**: if this is true, then group labels/divisions are irrelevant *
        * permute the labels for the groups
        * recalculate the statistic
            * Mean difference in counts
            * Geometric means
            * T statistic
        * Calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed
    * ***variations***

        Data type | Statistic | Test name
---|---|---|
Ranks | rank sum | rank sum test
Binary | hypergeometric prob | Fisher's exact test
Raw data | | ordinary permutation test

        * ***Note**: randomization tests are exactly permutation tests, with a different motivation *
        * For matched data, one can randomize the signs
        * For ranks, this results in the **signed rank test**
        * Permutation strategies work for regression by permuting a regressor of interest
        * Permutation tests work very well in multivariate settings
    * ***example***
        * we will compare groups **B** and **C** in this dataset for null hypothesis $H_0:$ there are no difference between the groups

```{r, fig.height=4, fig.width=6, echo=FALSE, fig.align='center'}
# load data
data(InsectSprays)
# plot boxplot of dataset
ggplot(InsectSprays, aes(spray, count, fill = spray)) + geom_boxplot()
```

* we will compare groups **B** and **C** in this dataset for null hypothesis $H_0:$ there are no difference between the groups

```{r}
# subset to only "B" and "C" groups
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
# values
y <- subdata$count
# labels
group <- as.character(subdata$spray)
# find mean difference between the groups
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
observedStat
```

* the observed difference between the groups is `r observedStat`
* now we changed the resample the lables for groups **B** and **C**

```{r}
# create 10000 permutations of the data with the labels' changed
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
# find the number of permutations whose difference that is bigger than the observed
mean(permutations > observedStat)
```

* we created 1000 permutations from the observed dataset, and found ***no datasets*** with mean differences between groups **B** and **C** larger than the original data
* therefore, p-value is very small and we can ***reject the null*** hypothesis with any reasonable $\alpha$ levels
* below is the plot for the null distribution/permutations

```{r, echo= FALSE, fig.width=6, fig.height=4, fig.align='center'}
# plot distribution of permutations
ggplot(data.frame(permutations = permutations),
           aes(permutations)) + geom_histogram(fill = "lightblue", color = "black", binwidth = 1) + geom_vline(xintercept = observedStat, size = 2)
```

* as we can see from the black line, the observed difference/statistic is very far from the mean $\rightarrow$ likely 0 is ***not*** the true difference
    * with this information, formal confidence intervals can be constructed and p-values can be calculated