$\pagebreak$ ## Overview * **Statistical Inference** = generating conclusions about a population from a noisy sample * Goal = extend beyond data to population * Statistical Inference = only formal system of inference we have * many different modes, but **two** broad flavors of inference (inferential paradigms): ***Bayesian*** vs ***Frequencist*** * **Frequencist** = uses long run proportion of times an event occurs independent identically distributed repetitions * frequentist is what this class is focused on * believes if an experiment is repeated many many times, the resultant percentage of success/something happening defines that population parameter * **Bayesian** = probability estimate for a hypothesis is updated as additional evidence is acquired * **statistic** = number computed from a sample of data * statistics are used to infer information about a population * **random variable** = outcome from an experiment * deterministic processes (variance/means) produce additional random variables when applied to random variables, and they have their own distributions ## Probability * **Probability** = the study of quantifying the likelihood of particular events occurring * given a random experiment, ***probability*** = population quantity that summarizes the randomness * not in the data at hand, but a conceptual quantity that exist in the population that we want to estimate ### General Probability Rules * discovered by Russian mathematician Kolmogorov, also known as "Probability Calculus" * probability = function of any set of outcomes and assigns it a number between 0 and 1 * $0 \le P(E) \le 1$, where $E$ = event * probability that nothing occurs = 0 (impossible, have to roll dice to create outcome), that something occurs is 1 (certain) * probability of outcome or event $E$, $P(E)$ = ratio of ways that $E$ could occur to number of all possible outcomes or events * probability of something = 1 - probability of the opposite occurring * probability of the **union** of any two sets of outcomes that have nothing in common (mutually exclusive) = sum of respective probabilities ```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'} library(grid);library(png) grid.raster(readPNG("figures/1.png")) ``` * if $A$ implies occurrence of $B$, then $P(A)$ occurring $< P(B)$ occurring ```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'} grid.raster(readPNG("figures/2.png")) ``` * for any two events, probability of at least one occurs = the sum of their probabilities - their intersection (in other words, probabilities can not be added simply if they have non-trivial intersection) ```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'} grid.raster(readPNG("figures/3.png")) ``` * for independent events $A$ and $B$, $P(A \:\cup\: B) = P(A) \times P(B)$ * for outcomes that can occur with different combination of events and these combinations are mutually exclusive, the $P(E_{total}) = \sum P(E_{part})$ ### Conditional Probability * let B = an event so that $P(B) > 0$ * **conditional probability** of an event A, given B is defined as the probability that BOTH A and B occurring divided by the probability of B occurring $$P(A\:|\:B) = \frac{P(A \:\cap\: B)}{P(B)}$$ * if A and B are ***independent***, then $$P(A\:|\:B) = \frac{P(A)P(B)}{P(B)} = P(A)$$ * ***example*** * for die roll, $A = \{1\}$, $B = \{1, 3, 5\}$, then $$P(1~|~Odd) = P(A\:|\:B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)}{P(B)} = \frac{1/6}{3/6} = \frac{1}{3}$$ ### Baye's Rule * definition $$P(B\:|\:A) = \frac{P(A\:|\:B)P(B)}{P(A\:|\:B)P(B)+P(A\:|\:B^c)P(B^c)}$$ where $B^c$ = corresponding probability of event $B$, $P(B^c) = 1 - P(B)$ $\pagebreak$ ## Random Variables * **random variable** = numeric outcome of experiment * **discrete** (what you can count/categories) = assign probabilities to every number/value the variable can take * coin flip, rolling a die, web traffic in a day * **continuous** (any number within a continuum) = assign probabilities to the range the variable can take * BMI index, intelligence quotients * ***Note**: limitations of precision in taking the measurements may imply that the values are discrete, but we in fact consider them continuous * + `rbinom()`, `rnorm()`, `rgamma()`, `rpois()`, `runif()` = functions to generate random variables from the binomial, normal, Gamma, Poisson, and uniform distributions * density and mass functions (population quantities, not what occurs in data) for random variables = best starting point to model/think about probabilities for numeric outcome of experiments (variables) * use data to estimate properties of population $\rightarrow$ linking sample to population ### Probability Mass Function (PMF) * evaluates the probability that the **discrete random variable** takes on a specific value - measures the chance of a particular outcome happening * always $\ge$ 0 for every possible outcome * $\sum$ possible values that the variable can take = 1 * ***Bernoulli distribution example*** * X = 0 $\rightarrow$ tails, X = 1 $\rightarrow$ heads * X here represents potential outcome * $P(X = x) = (\frac{1}{2})^x(\frac{1}{2})^{1-x}$ for $X = 0, 1$ * $x$ here represents a value we can plug into the PMF * general form $\rightarrow$ $p(x) = (\theta)^x(1-\theta)^{1-x}$ * `dbinom(k, n, p)` = return the probability of getting `k` successes out of `n` trials, given probability of success is `p` ### Probability Density Function (PDF) * evaluates the probability that the **continuous random variable** takes on a specific value * always $\ge$ 0 everywhere * total area under curve must = 1 * **areas under PDFs** correspond to the probabilities for that random variable taking on that range of values (PMF) ```{r echo = FALSE, fig.width = 2, fig.height = 2, fig.align = 'center'} grid.raster(readPNG("figures/4-1.png")) ``` * but the probability of the variable taking a specific value = 0 (area of a line is 0) ```{r echo = FALSE, fig.width = 3, fig.height = 3, fig.align = 'center'} grid.raster(readPNG("figures/4-2.png")) ``` * ***Note**: the above is true because it is modeling random variables as if they have infinite precision, when in reality they do not * * `dnorm()`, `dgamma()`, `dpois()`, `dunif()` = return probability of a certain value from the normal, Gamma, Poisson, and uniform distributions ### Cumulative Distribution Function (CDF) * CDF of a random variable $X$ = probability that the random variable is $\le$ value $x$ * $F(x) = P(X \le x)$ = applies when $X$ is discrete/continuous * PDF = derivative of CDF * integrate PDF $\rightarrow$ CDF * `integrate(function, lower=0, upper=1)` $\rightarrow$ can be used to evaluate integrals for a specified range * `pbinom()`, `pnorm()`, `pgamma()`, `ppois()`, `punif()` = returns the cumulative probabilities from 0 up to a specified value from the binomial, normal, Gamma, Poisson, and uniform distributions ### Survival Function * survival function of a random variable $X$ = probability the random variable $> x$, complement of CDF * $S(x) = P(X > x) = 1 - F(x)$, where $F(x) =$ CDF ### Quantile * the $\alpha^{th}$ quantile of a distribution with distribution function F = point $x_{\alpha}$ * $F(x_{\alpha}) = \alpha$ * percentile = quantile with $\alpha$ expressed as a percent * median = 50^th^ percentile * $\alpha$% of the possible outcomes lie below it ```{r echo = FALSE, fig.width = 3, fig.height = 3, fig.align = 'center'} grid.raster(readPNG("figures/5.png")) ``` * `qbeta(quantileInDecimals, 2, 1)` = returns quantiles for beta distribution * works for `qnorm()`, `qbinom()`, `qgamma()`, `qpois()`, etc. * median estimated in this fashion = a population median * probability model connects data to population using assumptions * population median = ***estimand***, sample median = ***estimator*** ### Independence * two events $A$ and $B$ are ***independent*** if the following is true * $P(A\:\cap\:B) = P(A)P(B)$ * $P(A\:|\:B) = P(A)$ * two random variables $X$ and $Y$ are ***independent***, if for any two sets, **A** and **B**, the following is true * $P([X \in A]\cap[Y \in B]) = P(X \in A)P(Y \in B)$ * **independence** = statistically unrelated from one another * if $A$ is ***independent*** of $B$, then the following are true * $A^c$ is independent of $B$ * $A$ is independent of $B^c$ * $A^c$ is independent of $B^c$ ### IID Random Variables * random variables are said to be **IID** if they are ***independent and identically distributed*** * **independent** = statistically unrelated from each other * **identically distributed** = all having been drawn from the same population distribution * IID random variables = default model for random samples = default starting point of inference $\pagebreak$ ## Diagnostic Test * Let $+$ and $-$ be the results, positive and negative respectively, of a diagnostic test * Let $D$ = subject of the test has the disease, $D^c$ = subject does not * **sensitivity** = $P(+\:|\:D)$ = probability that the test is positive given that the subject has the disease (the higher the better) * **specificity** = $P(-\:|\:D^c)$ = probability that the test is negative given that the subject does not have the disease (the higher the better) * **positive predictive value** = $P(D\:|\:+)$ = probability that that subject has the disease given that the test is positive * **negative predictive value** = $P(D^c\:|\:-)$ = probability that the subject does not have the disease given the test is negative * **prevalence of disease** = $P(D)$ = marginal probability of disease ### Example * specificity of 98.5%, sensitivity = 99.7%, prevalence of disease = .1% $$\begin{aligned} P(D ~|~ +) & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ & = \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ & = .062 \end{aligned}$$ * low positive predictive value $\rightarrow$ due to low prevalence of disease and somewhat modest specificity * suppose it was know that the subject uses drugs and has regular intercourse with an HIV infect partner (his probability of being + is higher than suspected) * evidence implied by a positive test result ### Likelihood Ratios * **diagnostic likelihood ratio** of a **positive** test result is defined as $$DLR_+ = \frac{sensitivity}{1-specificity} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)}$$ * **diagnostic likelihood ratio** of a **negative** test result is defined as $$DLR_- = \frac{1 - sensitivity}{specificity} = \frac{P(-\:|\:D)}{P(-\:|\:D^c)}$$ * from Baye's Rules, we can derive the *positive predictive value* and *false positive value* $$P(D\:|\:+) = \frac{P(+\:|\:D)P(D)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(1)}$$ $$P(D^c\:|\:+) = \frac{P(+\:|\:D^c)P(D^c)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(2)}$$ * if we divide equation $(1)$ over $(2)$, the quantities over have the same denominator so we get the following $$\frac{P(D\:|\:+)}{P(D^c\:|\:+)} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)} \times \frac{P(D)}{P(D^c)}$$ which can also be written as $$\mbox{post-test odds of D} = DLR_+ \times \mbox{pre-test odds of D}$$ * **odds** = $p/(1-p)$ * $\frac{P(D)}{P(D^c)}$ = **pre-test odds**, or odds of disease in absence of test * $\frac{P(D\:|\:+)}{P(+\:|\:D^c)}$ = **post-test odds**, or odds of disease given a positive test result * $DLR_+$ = factor by which the odds in the presence of a positive test can be multiplied to obtain the post-test odds * $DLR_-$ = relates the decrease in odds of disease after a negative result * following the previous example, for sensitivity of 0.997 and specificity of 0.985, so the diagnostic likelihood ratios are as follows $$DLR_+ = .997/(1-.985) = 66 ~~~~~~ DLR_- =(1-.997)/.985 = 0.003$$ * this indicates that the result of the positive test is the odds of disease is 66 times the pretest odds $\pagebreak$ ## Expected Values/Mean * useful for characterizing a distribution (properties of distributions) * **mean** = characterization of the center of the distribution = *expected value* * expected value operation = ***linear*** $\rightarrow$ $E(aX +bY) = aE(X) + bE(Y)$ * **variance/standard deviation** = characterization of how spread out the distribution is * _sample_ expected values for sample mean and variance will estimate the _population_ counterparts * **population mean** * expected value/mean of a random variable = center of its distribution (center of mass) * ***discrete variables*** * for $X$ with PMF $p(x)$, the population mean is defined as $$E[X] = \sum_{x} xp(x)$$ where the sum is taken over ***all*** possible values of $x$ * $E[X]$ = center of mass of a collection of location and weights ${x,~p(x)}$ * *coin flip example*: $E[X] = 0 \times (1-p) + 1 \times p = p$ * ***continuous variable*** * for $X$ with PDF $f(x)$, the expected value = the center of mass of the density * instead of summing over discrete values, the expectation ***integrates*** over a continuous function * PDF = $f(x)$ * $\int xf(x)$ = area under the PDF curve = mean/expected value of $X$ * **sample mean** * sample mean estimates the population mean * sample mean = center of mass of observed data = empirical mean $$\bar X = \sum_{x}^n x_i p(x_i)$$ where $p(x_i) = 1/n$ ```{r fig.width = 5, fig.height = 3, fig.align = 'center', message = F, warning = F} # load relevant packages library(UsingR); data(galton); library(ggplot2) # plot galton data g <- ggplot(galton, aes(x = child)) # add histogram for children data g <- g + geom_histogram(fill = "salmon", binwidth=1, aes(y=..density..), colour="black") # add density smooth g <- g + geom_density(size = 2) # add vertical line g <- g + geom_vline(xintercept = mean(galton$child), size = 2) # print graph g ``` * **average of random variables** = a new random variable where its distribution has an expected value that is the **same** as the original distribution (centers are the same) * the mean of the averages = average of the original data $\rightarrow$ estimates average of the population * if $E[sample~mean]$ = population mean, then estimator for the sample mean is **unbiased** * [**derivation**] let $X_1$, $X_2$, $X_3$, ... $X_n$ be a collection of $n$ samples from the population with mean $\mu$ * mean of this sample $$\bar X = \frac{X_1 + X_2 + X_3 + . + X_n}{n}$$ * since $E(aX) = aE(X)$, the expected value of the mean is can be written as $$E\left[\frac{X_1 + X_2 + X_3 + ... + X_n}{n}\right] = \frac{1}{n} \times [E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)]$$ * since each of the $E(X_i)$ is drawn from the population with mean $\mu$, the expected value of each sample should be $$E(X_i) = \mu$$ * therefore $$\begin{aligned} E\left[\frac{X_1 + X_2 + X_3 + ... + X_n}{n}\right] & = \frac{1}{n} \times [E(X_1) + E(X_2) + E(X_3) + ... + E(X_n)]\\ & = \frac{1}{n} \times [\mu + \mu + \mu + ... + \mu]\\ & = \frac{1}{n} \times n \times \mu\\ & = \mu\\ \end{aligned}$$ * ***Note**: the more data that goes into the sample mean, the more concentrated its density/mass functions are around the population mean * ```{r fig.width = 6, fig.height = 3, fig.align = 'center', message = F, warning = F} nosim <- 1000 # simulate data for sample size 1 to 4 dat <- data.frame( x = c(sample(1 : 6, nosim, replace = TRUE), apply(matrix(sample(1 : 6, nosim * 2, replace = TRUE), nosim), 1, mean), apply(matrix(sample(1 : 6, nosim * 3, replace = TRUE), nosim), 1, mean), apply(matrix(sample(1 : 6, nosim * 4, replace = TRUE), nosim), 1, mean)), size = factor(rep(1 : 4, rep(nosim, 4)))) # plot histograms of means by sample size g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.25, colour = "black") g + facet_grid(. ~ size) ``` $\pagebreak$ ## Variance ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # generate x value ranges xvals <- seq(-10, 10, by = .01) # generate data from normal distribution for sd of 1 to 4 dat <- data.frame( y = c(dnorm(xvals, mean = 0, sd = 1), dnorm(xvals, mean = 0, sd = 2), dnorm(xvals, mean = 0, sd = 3), dnorm(xvals, mean = 0, sd = 4)), x = rep(xvals, 4), factor = factor(rep(1 : 4, rep(length(xvals), 4))) ) # plot 4 lines for the different standard deviations ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2) ``` * **variance** = measure of spread or dispersion, the expected squared distance of the variable from its mean (expressed in $X$'s units$^2$) - as we can see from above, higher variances $\rightarrow$ more spread, lower $\rightarrow$ smaller spread * $Var(X) = E[(X-\mu)^2] = E[X^2] - E[X]^2$ * **standard deviation** $= \sqrt{Var(X)}$ $\rightarrow$ has same units as X * ***example*** * for die roll, $E[X] = 3.5$ * $E[X^2] = 1^2 \times 1/6 + 2^2 \times 1/6 + 3^2 \times 1/6 + 4^2 \times 1/6 + 5^2 \times 1/6 + 6^2 \times 1/6 = 15.17$ * $Var(X) = E[X^2] - E[X]^2 \approx 2.92$ * ***example*** * for coin flip, $E[X] = p$ * $E[X^2] = 0^2 \times (1 - p) + 1^2 \times p= p$ * $Var(X) = E[X^2] - E[X]^2 = p - p^2 = p(1-p)$ ### Sample Variance * the **sample variance** is defined as $$S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1}$$ ```{r echo = FALSE, fig.width = 4, fig.height = 3, fig.align = 'center'} grid.raster(readPNG("figures/6.png")) ``` * on the above line representing the population (in magenta), any subset of data (3 of 14 selected, marked in blue) will most likely have a variance that is ***lower than*** the population variance * dividing by $n - 1$ will make the variance estimator ***larger*** to adjust for this fact $\rightarrow$ leads to more accurate estimation $\rightarrow$ $S^2$ = so called ***unbiased estimate of population variance*** * $S^2$ is a random variable, and therefore has an associated population distribution * $E[S^2]$ = population variance, where $S$ = sample standard deviation * as we see from the simulation results below, with more data, the distribution for $S^2$ gets more concentrated around population variance ```{r fig.width = 4, fig.height = 3, fig.align = 'center'} # specify number of simulations nosim <- 10000; # simulate data for various sample sizes dat <- data.frame( x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var), apply(matrix(rnorm(nosim * 20), nosim), 1, var), apply(matrix(rnorm(nosim * 30), nosim), 1, var)), n = factor(rep(c("10", "20", "30"), c(nosim, nosim, nosim))) ) # plot density function for different sample size data ggplot(dat, aes(x = x, fill = n)) + geom_density(size = 1, alpha = .2) + geom_vline(xintercept = 1, size = 1) ``` * ***Note**: for any variable, properties of the population = **parameter**, estimates of properties for samples = **statistic** * - below is a summary for the mean and variance for population and sample ```{r echo = FALSE, fig.width = 6, fig.height = 5, fig.align = 'center'} grid.raster(readPNG("figures/8.png")) ``` * **distribution for mean of random samples** * expected value of the **mean** of distribution of means = expected value of the sample mean = population mean * $E[\bar X]=\mu$ * expected value of the variance of distribution of means * $Var(\bar X) = \sigma^2/n$ * as **n** becomes larger, the mean of random sample $\rightarrow$ more concentrated around the population mean $\rightarrow$ variance approaches 0 * this again confirms that sample mean estimates population mean * ***Note**: normally we only have 1 sample mean (from collected sample) and can estimate the variance $\sigma^2$ $\rightarrow$ so we know a lot about the **distribution of the means** from the data observed * * **standard error (SE)** * the standard error of the mean is defined as $$SE_{mean} = \sigma/\sqrt{n}$$ * this quantity is effectively the standard deviation of the distribution of a statistic (i.e. mean) * represents variability of means ### Entire Estimator-Estimation Relationship * Start with a sample * $S^2$ = sample variance * estimates how variable the population is * estimates population variance $\sigma^2$ * $S^2$ = a random variable and has its own distribution centered around $\sigma^2$ * more concentrated around $\sigma^2$ as $n$ increases * $\bar X$ = sample mean * estimates population mean $\mu$ * $\bar X$ = a random variable and has its own distribution centered around $\mu$ * more concentrated around $\mu$ as $n$ increases * variance of distribution of $\bar X = \sigma^2/n$ * estimate of variance = $S^2/n$ * estimate of standard error = $S/\sqrt{n}$ $\rightarrow$ "sample standard error of the mean" - estimates how variable sample means ($n$ size) from the population are ### Example - Standard Normal * variance = 1 * means of **n** standard normals (sample) have standard deviation = $1/\sqrt{n}$ ```{r message = F, warning = F} # specify number of simulations with 10 as number of observations per sample nosim <- 1000; n <-10 # estimated standard deviation of mean sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean)) # actual standard deviation of mean of standard normals 1 / sqrt(n) ``` * `rnorm()` = generate samples from the standard normal * `matrix()` = puts all samples into a nosim by $n$ matrix, so that each row represents a simulation with `nosim` observations * `apply()` = calculates the mean of the $n$ samples * `sd()` = returns standard deviation ### Example - Standard Uniform * standard uniform $\rightarrow$ triangle straight line distribution $\rightarrow$ mean = 1/2 and variance = 1/12 * means of random samples of $n$ uniforms have have standard deviation of $1/\sqrt{12 \times n}$ ```{r message = F, warning = F} # estimated standard deviation of the sample means sd(apply(matrix(runif(nosim * n), nosim), 1, mean)) # actual standard deviation of the means 1/sqrt(12*n) ``` ### Example - Poisson * $Poisson(x^2)$ have variance of $x^2$ * means of random samples of $n~ Poisson(4)$ have standard deviation of $2/\sqrt{n}$ ```{r message = F, warning = F} # estimated standard deviation of the sample means sd(apply(matrix(rpois(nosim * n, lambda=4), nosim), 1, mean)) # actual standard deviation of the means 2/sqrt(n) ``` ### Example - Bernoulli * for $p = 0.5$, the Bernoulli distribution has variance of 0.25 * means of random samples of $n$ coin flips have standard deviations of $1 / (2 \sqrt{n})$ ```{r message = F, warning = F} # estimated standard deviation of the sample means sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), nosim), 1, mean)) # actual standard deviation of the means 1/(2*sqrt(n)) ``` ### Example - Father/Son ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # load data library(UsingR); data(father.son); # define son height as the x variable x <- father.son$sheight # n is the length n<-length(x) # plot histogram for son's heights g <- ggplot(data = father.son, aes(x = sheight)) g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black") g <- g + geom_density(size = 2, colour = "black") g # we calculate the parameters for variance of distribution and sample mean, round(c(sampleVar = var(x), sampleMeanVar = var(x) / n, # as well as standard deviation of distribution and sample mean sampleSd = sd(x), sampleMeanSd = sd(x) / sqrt(n)),2) ``` $\pagebreak$ ## Binomial Distribution * **binomial random variable** = sum of **n** Bernoulli variables $$X = \sum_{i=1}^n X_i$$ where $X_1,\ldots,X_n = Bernoulli(p)$ * PMF is defined as $$P(X=x) = {n \choose x}p^x(1-p)^{n-x}$$ where ${n \choose x}$ = number of ways selecting $x$ items out of $n$ options without replacement or regard to order and for $x=0,\ldots,n$ * **combination** or "$n$ choose $x$" is defined as $${n \choose x} = \frac{n!}{x!(n-x)!}$$ * the base cases are $${n \choose n} = {n \choose 0} = 1$$ * **Bernoulli distribution** = binary outcome * only possible outcomes * 1 = "success" with probability of $p$ * 0 = "failure" with probability of $1 - p$ * PMF is defined as $$P(X=x) = p^x(1 - p)^{1-x}$$ * mean = $p$ * variance = $p(1 - p)$ ### Example * of 8 children, whats the probability of 7 or more girls (50/50 chance)? $${8 \choose 7}.5^7(1-.5)^{1} + {8 \choose 8}.5^8(1-.5)^{0} \approx 0.04$$ ```{r message = F, warning = F} # calculate probability using PMF choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8 # calculate probability using CMF from distribution pbinom(6, size = 8, prob = .5, lower.tail = FALSE) ``` * `choose(8, 7)` = R function to calculate n choose x * `pbinom(6, size=8, prob =0.5, lower.tail=TRUE)` = probability of 6 or less successes out of 8 samples with probability of 0.5 (CMF) - `lower.tail=FALSE` = returns the complement, in this case it's the probability of greater than 6 successes out of 8 samples with probability of 0.5 $\pagebreak$ ## Normal Distribution * normal/Gaussian distribution for random variable X * notation = $X \sim N(\mu, \sigma^2)$ * mean = $E[X] = \mu$ * variance = $Var(X) = \sigma^2$ * PMF is defined as $$f(x)=(2\pi\sigma^2)^{-1/2}e^{-(x-\mu)^2/2\sigma^2}$$ * $X \sim N(0, 1)$ = **standard normal distribution** (standard normal random variables often denoted using $Z_1, Z_2, \ldots$) - ***Note**: see below graph for reference for the following observations * * ~68% of data/normal density $\rightarrow$ between $\pm$ 1 standard deviation from $\mu$ * ~95% of data/normal density $\rightarrow$ between $\pm$ 2 standard deviation from $\mu$ * ~99% of data/normal density $\rightarrow$ between $\pm$ 3 standard deviation from $\mu$ * $\pm$ 1.28 standard deviations from $\mu$ $\rightarrow$ 10$^{th}$ (-) and 90$^{th}$ (+) percentiles * $\pm$ 1.645 standard deviations from $\mu$ $\rightarrow$ 5$^{th}$ (-) and 95$^{th}$ (+) percentiles * $\pm$ 1.96 standard deviations from $\mu$ $\rightarrow$ 2.5$^{th}$ (-) and 97.5$^{th}$ (+) percentiles * $\pm$ 2.33 standard deviations from $\mu$ $\rightarrow$ 1$^{st}$ (-) and 99$^{th}$ (+) percentiles ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # plot standard normal x <- seq(-3, 3, length = 1000) g <- ggplot(data.frame(x = x, y = dnorm(x)), aes(x = x, y = y)) + geom_line(size = 2) g <- g + geom_vline(xintercept = -3 : 3, size = 2) g ``` * for any $X \sim N(\mu, \sigma^2)$, calculating the number of standard deviations each observation is from the mean ***converts*** the random variable to a ***standard normal*** (denoted as $Z$ below) $$Z=\frac{X-\mu}{\sigma} \sim N(0,1)$$ * conversely, a standard normal can then be converted to ***any normal distribution*** by multiplying by standard deviation and adding the mean $$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$ * `qnorm(n, mean=mu, sd=sd)` = returns the $n^{th}$ percentiles for the given normal distribution * `pnorm(x, mean=mu, sd=sd, lower.tail=F)` = returns the probability of an observation drawn from the given distribution is larger in value than the specified threshold $x$ ### Example * the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50 * What's the probability of getting more than 1,160 clicks in a day? ```{r} # calculate number of standard deviations from the mean (1160 - 1020) / 50 # calculate probability using given distribution pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE) # calculate probability using standard normal pnorm(2.8, lower.tail = FALSE) ``` * therefore, it is not very likely (`r pnorm(2.8, lower.tail = FALSE)*100`% chance), since 1,160 is `r (1160 - 1020) / 50` standard deviations from the mean * What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming days are independent and identically distributed)? ```{r} qnorm(0.75, mean = 1020, sd = 50) ``` * therefore, `r qnorm(0.75, mean = 1020, sd = 50)` would represent the threshold that has more clicks than 75% of days $\pagebreak$ ## Poisson Distribution * used to model counts * mean = $\lambda$ * variance = $\lambda$ * PMF is defined as $$P(X = x; \lambda)=\frac{\lambda^xe^{-\lambda}}{x!}$$ where $X = 0, 1, 2, ... \infty$ * modeling uses for Poisson distribution * count data * event-time/survival $\rightarrow$ cancer trials, some patients never develop and some do, dealing with the data for both ("censoring") * contingency tables $\rightarrow$ record results for different characteristic measurements * approximating binomials $\rightarrow$ instances where **n** is large and **p** is small (i.e. pollution on lung disease) * $X \sim Binomial(n, p)$ * $\lambda = np$ * rates $\rightarrow$ $X \sim Poisson(\lambda t)$ * $\lambda = E[X/t]$ $\rightarrow$ expected count per unit of time * $t$ = total monitoring time - `ppois(n, lambda = lambda*t)` = returns probability of $n$ or fewer events happening given the rate $\lambda$ and time $t$ ### Example * number of people that show up at a bus stop can be modeled with Poisson distribution with a mean of 2.5 per hour * after watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time? ```{r} # calculate using distribution ppois(3, lambda = 2.5 * 4) ``` * as we can see from above, there is a `r ppois(3, lambda = 2.5 * 4)*100`% chance for 3 or fewer people show up total at the bus stop during 4 hours of monitoring ### Example - Approximating Binomial Distribution * flip a coin with success probability of 0.01 a total 500 times (low $p$, large $n$) * what's the probability of 2 or fewer successes? ```{r} # calculate correct probability from Binomial distribution pbinom(2, size = 500, prob = .01) # estimate probability using Poisson distribution ppois(2, lambda=500 * .01) ``` * as we can see from above, the two probabilities (`r pbinom(2, size = 500, prob = .01)*100`% vs `r pbinom(2, size = 500, prob = .01)*100`%) are extremely close $\pagebreak$ ## Asymptotics * **asymptotics** = behavior of statistics as sample size $\rightarrow$ $\infty$ * useful for simple statistical inference/approximations * form basis for frequentist interpretation of probabilities ("Law of Large Numbers") ### Law of Large Numbers (LLN) * IID sample statistic that estimates property of the sample (i.e. mean, variance) ***becomes*** the population statistic (i.e. population mean, population variance) as $n$ increases * ***Note**: an estimator is **consistent** if it converges to what it is estimating * * sample mean/variance/standard deviation are all ***consistent estimators*** for their population counterparts - $\bar X_n$ is average of the result of $n$ coin flips (i.e. the sample proportion of heads) - as we flip a fair coin over and over, it ***eventually converges*** to the true probability of a head ### Example - LLN for Normal and Bernoulli Distribution * for this example, we will simulate 10000 samples from the normal and Bernoulli distributions respectively * we will plot the distribution of sample means as $n$ increases and compare it to the population means ```{r fig.width = 6, fig.height = 3, fig.align = 'center', message = F, warning = F} # load library library(gridExtra) # specify number of trials n <- 10000 # calculate sample (from normal distribution) means for different size of n means <- cumsum(rnorm(n)) / (1 : n) # plot sample size vs sample mean g <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y)) g <- g + geom_hline(yintercept = 0) + geom_line(size = 2) g <- g + labs(x = "Number of obs", y = "Cumulative mean") g <- g + ggtitle("Normal Distribution") # calculate sample (coin flips) means for different size of n means <- cumsum(sample(0 : 1, n , replace = TRUE)) / (1 : n) # plot sample size vs sample mean p <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y)) p <- p + geom_hline(yintercept = 0.5) + geom_line(size = 2) p <- p + labs(x = "Number of obs", y = "Cumulative mean") p <- p + ggtitle("Bernoulli Distribution (Coin Flip)") # combine plots grid.arrange(g, p, ncol = 2) ``` * as we can see from above, for both distributions the sample means undeniably approach the respective population means as $n$ increases ### Central Limit Theorem * one of the most important theorems in statistics * distribution of means of IID variables approaches the standard normal as sample size $n$ increases * in other words, for large values of $n$, $$\frac{\mbox{Estimate} - \mbox{Mean of Estimate}}{\mbox{Std. Err. of Estimate}} = \frac{\bar X_n - \mu}{\sigma / \sqrt{n}}=\frac{\sqrt n (\bar X_n - \mu)}{\sigma} \longrightarrow N(0, 1)$$ * this translates to the distribution of the sample mean $\bar X_n$ is approximately $N(\mu, \sigma^2/n)$ - distribution is centered at the population mean - with standard deviation = standard error of the mean * typically the Central Limit Theorem can be applied when $n \geq 30$ ### Example - CLT with Bernoulli Trials (Coin Flips) - for this example, we will simulate $n$ flips of a possibly unfair coin - let $X_i$ be the 0 or 1 result of the $i^{th}$ flip of a possibly unfair coin + sample proportion , $\hat p$, is the average of the coin flips + $E[X_i] = p$ and $Var(X_i) = p(1-p)$ + standard error of the mean is $SE = \sqrt{p(1-p)/n}$ + in principle, normalizing the random variable $X_i$, we should get an approximately standard normal distribution $$\frac{\hat p - p}{\sqrt{p(1-p)/n}} \sim N(0,~1)$$ - therefore, we will flip a coin $n$ times, take the sample proportion of heads (successes with probability $p$), subtract off 0.5 (ideal sample proportion) and multiply the result by $\frac{1}{2 \sqrt{n}}$ and compare it to the standard normal ```{r, echo = FALSE, fig.width=6, fig.height = 3, fig.align='center'} # specify number of simulations nosim <- 1000 # convert to standard normal cfunc <- function(x, n) 2 * sqrt(n) * (mean(x) - 0.5) # simulate data for sample sizes 10, 20, and 30 dat <- data.frame( x = c(apply(matrix(sample(0:1, nosim*10, replace=TRUE), nosim), 1, cfunc, 10), apply(matrix(sample(0:1, nosim*20, replace=TRUE), nosim), 1, cfunc, 20), apply(matrix(sample(0:1, nosim*30, replace=TRUE), nosim), 1, cfunc, 30)), size = factor(rep(c(10, 20, 30), rep(nosim, 3)))) # plot histograms for the trials g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) # plot standard normal distribution for reference g <- g + stat_function(fun = dnorm, size = 2) # plot panel plots by sample size g + facet_grid(. ~ size) ``` * now, we can run the same simulation trials for an extremely unfair coin with $p$ = 0.9 ```{r, echo = FALSE, fig.width=6, fig.height = 3, fig.align='center'} # specify number of simulations nosim <- 1000 # convert to standard normal cfunc <- function(x, n) sqrt(n) * (mean(x) - 0.9) / sqrt(.1 * .9) # simulate data for sample sizes 10, 20, and 30 dat <- data.frame( x = c(apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 10, replace = TRUE), nosim), 1, cfunc, 10), apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 20, replace = TRUE), nosim), 1, cfunc, 20), apply(matrix(sample(0:1, prob = c(.1,.9), nosim * 30, replace = TRUE), nosim), 1, cfunc, 30)), size = factor(rep(c(10, 20, 30), rep(nosim, 3)))) # plot histograms for the trials g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) # plot standard normal distribution for reference g <- g + stat_function(fun = dnorm, size = 2) # plot panel plots by sample size g + facet_grid(. ~ size) ``` * as we can see from both simulations, the converted/standardized distribution of the samples convert to the standard normal distribution * ***Note**: speed at which the normalized coin flips converge to normal distribution depends on how biased the coin is (value of $p$) * * ***Note**: does not guarantee that the normal distribution will be a good approximation, but just that eventually it will be a good approximation as n $\rightarrow \infty$ * ### Confidence Intervals - Normal Distribution/Z Intervals * **Z confidence interval** is defined as $$Estimate \pm ZQ \times SE_{Estimate}$$ where $ZQ$ = quantile from the standard normal distribution * according to CLT, the sample mean, $\bar X$, is approximately normal with mean $\mu$ and sd $\sigma / \sqrt{n}$ * **95% confidence interval for the population mean $\mu$** is defined as $$\bar X \pm 2\sigma/\sqrt{n}$$ for the sample mean $\bar X \sim N(\mu, \sigma^2/n)$ * you can choose to use 1.96 to be more accurate for the confidence interval * $P(\bar{X} > \mu + 2\sigma/\sqrt{n}~or~\bar{X} < \mu - 2\sigma/\sqrt{n}) = 5\%$ * **interpretation**: if we were to repeatedly draw samples of size $n$ from the population and construct this confidence interval for each case, approximately 95% of the intervals will contain $\mu$ * confidence intervals get **narrower** with less variability or larger sample sizes * ***Note**: Poisson and binomial distributions have exact intervals that don't require CLT * * ***example*** - for this example, we will compute the 95% confidence interval for sons height data in inches ```{r message = F, warning = F} # load son height data data(father.son); x <- father.son$sheight # calculate confidence interval for sons height in inches mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x)) ``` ### Confidence Interval - Bernoulli Distribution/Wald Interval * for Bernoulli distributions, $X_i$ is 0 or 1 with success probability $p$ and the variance is $\sigma^2 = p(1 - p)$ * the confidence interval takes the form of $$\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$ * since the population proportion $p$ is unknown, we can use the sampled proportion of success $\hat{p} = X/n$ as estimate * $p(1-p)$ is largest when $p = 1/2$, so 95% confidence interval can be calculated by $$\begin{aligned} \hat{p} \pm Z_{0.95} \sqrt{\frac{0.5(1-0.5)}{n}} & = \hat{p} \pm qnorm(.975) \sqrt{\frac{1}{4n}}\\ & = \hat{p} \pm 1.96 \sqrt{\frac{1}{4n}}\\ & = \hat{p} \pm \frac{1.96}{2} \sqrt{\frac{1}{n}}\\ & \approx \hat{p} \pm \frac{1}{\sqrt{n}}\\ \end{aligned}$$ * this is known as the **Wald Confidence Interval** and is useful in ***roughly estimating*** confidence intervals * generally need $n$ = 100 for 1 decimal place, 10,000 for 2, and 1,000,000 for 3 * ***example*** - suppose a random sample of 100 likely voters, 56 intent to vote for you, can you secure a victory? - we can use the Wald interval to quickly estimate the 95% confidence interval - as we can see below, because the interval [`r 0.56 + c(-1, 1) * 1/sqrt(100)`] contains values below 50%, victory is not guaranteed - `binom.test(k, n)$conf` = returns confidence interval binomial distribution (collection of Bernoulli trial) with `k` successes in `n` draws ```{r message = F, warning = F} # define sample probability and size p = 0.56; n = 100 # Wald interval c("WaldInterval" = p + c(-1, 1) * 1/sqrt(n)) # 95% confidence interval c("95CI" = p + c(-1, 1) * qnorm(.975) * sqrt(p * (1-p)/n)) # perform binomial test binom.test(p*100, n*100)$conf.int ``` ### Confidence Interval - Binomial Distribution/Agresti-Coull Interval * for a binomial distribution with smaller values of $n$ (when $n$ < 30, thus not large enough for CLT), often time the normal confidence intervals, as defined by $$\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$ **do not** provide accurate estimates ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # simulate 1000 samples of size 20 each n <- 20; nosim <- 1000 # simulate for p values from 0.1 to 0.9 pvals <- seq(.1, .9, by = .05) # calculate the confidence intervals coverage <- sapply(pvals, function(p){ # simulate binomial data phats <- rbinom(nosim, prob = p, size = n) / n # calculate lower 95% CI bound ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n) # calculate upper 95% CI bound ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n) # calculate percent of intervals that contain p mean(ll < p & ul > p) }) # plot CI results vs 95% ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0) ``` * as we can see from above, the interval do not provide adequate coverage as 95% confidence intervals (frequently only provide 80 to 90% coverage) * we can construct the **Agresti-Coull Interval**, which is defined uses the adjustment $$\hat{p} = \frac{X+2}{n+4}$$ where we effectively ***add 2*** to number of successes, $X$, and ***add 2*** to number of failure * therefore the interval becomes $$\frac{X+2}{n+4} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}$$ * ***Note**: interval tend to be **conservative** * * ***example*** ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # simulate 1000 samples of size 20 each n <- 20; nosim <- 1000 # simulate for p values from 0.1 to 0.9 pvals <- seq(.1, .9, by = .05) # calculate the confidence intervals coverage <- sapply(pvals, function(p){ # simulate binomial data with Agresti/Coull Interval adjustment phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4) # calculate lower 95% CI bound ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n) # calculate upper 95% CI bound ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n) # calculate percent of intervals that contain p mean(ll < p & ul > p) }) # plot CI results vs 95% ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0) ```` * as we can see from above, the coverage is much better for the 95% interval * in fact, all of the estimates are more conservative as we previously discussed, indicating the Agresti-Coull intervals are ***wider*** than the regular confidence intervals ### Confidence Interval - Poisson Interval * for $X \sim Poisson(\lambda t)$ * estimate rate $\hat{\lambda} = X/t$ * $var(\hat{\lambda}) = \lambda/t$ * variance estimate $= \hat{\lambda}/t$ * so the confidence interval is defined as $$\hat \lambda \pm z_{1-\alpha/2}\sqrt{\frac{\lambda}{t}}$$ * however, for small values of $\lambda$ (few events larger time interval), we **should not** use the asymptotic interval estimated * ***example*** - for this example, we will go through a specific scenario as well as a simulation exercise to demonstrate the ineffectiveness of asymptotic intervals for small values of $\lambda$ - nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day? + `poisson.test(x, T)$conf` = returns Poisson 95% confidence interval for given `x` occurrence over `T` time period ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # define parameters x <- 5; t <- 94.32; lambda <- x / t # calculate confidence interval round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3) # return accurate confidence interval from poisson.test poisson.test(x, T = 94.32)$conf # small lambda simulations lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 100 # calculate coverage using Poisson intervals coverage <- sapply(lambdavals, function(lambda){ # calculate Poisson rates lhats <- rpois(nosim, lambda = lambda * t) / t # lower bound of 95% CI ll <- lhats - qnorm(.975) * sqrt(lhats / t) # upper bound of 95% CI ul <- lhats + qnorm(.975) * sqrt(lhats / t) # calculate percent of intervals that contain lambda mean(ll < lambda & ul > lambda) }) # plot CI results vs 95% ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0) ``` * as we can see above, for small values of $\lambda = X/t$, the confidence interval produced by the asymptotic interval is ***not*** an accurate estimate of the actual 95% interval (not enough coverage) * however, as $t \to \infty$, the interval becomes the ***true 95% interval*** ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # small lambda simulations lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 1000 # calculate coverage using Poisson intervals coverage <- sapply(lambdavals, function(lambda){ # calculate Poisson rates lhats <- rpois(nosim, lambda = lambda * t) / t # lower bound of 95% CI ll <- lhats - qnorm(.975) * sqrt(lhats / t) # upper bound of 95% CI ul <- lhats + qnorm(.975) * sqrt(lhats / t) # calculate percent of intervals that contain lambda mean(ll < lambda & ul > lambda) }) # plot CI results vs 95% ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(0, 1.0) ``` * as we can see from above, as $t$ increases, the Poisson intervals become closer to the actual 95% confidence intervals ### Confidence Intervals - T Distribution(Small Samples) * **t** confidence interval is defined as $$Estimate \pm TQ \times SE_{Estimate} = \bar X \pm \frac{t_{n-1} S}{\sqrt{n}}$$ * $TQ$ = quantile from T distribution * $t_{n-1}$ = relevant quantile * $t$ interval assumes data is IID normal so that $$\frac{\bar X - \mu}{S/\sqrt{n}}$$ follows Gosset's $t$ distribution with $n-1$ degrees of freedom * works well with data distributions that are roughly symmetric/mound shaped, and ***does not*** work with skewed distributions * skewed distribution $\rightarrow$ meaningless to center interval around the mean $\bar X$ * logs/median can be used instead * paired observations (multiple measurements from same subjects) can be analyzed by t interval of differences * as more data collected (large degrees of freedom), t interval $\rightarrow$ z interval * `qt(0.975, df=n-1)` = calculate the relevant quantile using t distribution ```{r fig.width = 6, fig.height = 3, fig.align = 'center', message = F, warning = F} # Plot normal vs t distributions k <- 1000; xvals <- seq(-5, 5, length = k); df <- 10 d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),x = xvals, dist = factor(rep(c("Normal", "T"), c(k,k)))) g <- ggplot(d, aes(x = x, y = y)) g <- g + geom_line(size = 2, aes(colour = dist)) + ggtitle("Normal vs T Distribution") # plot normal vs t quantiles d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),p = pvals) h <- ggplot(d, aes(x= n, y = t)) h <- h + geom_abline(size = 2, col = "lightblue") h <- h + geom_line(size = 2, col = "black") h <- h + geom_vline(xintercept = qnorm(0.975)) h <- h + geom_hline(yintercept = qt(0.975, df)) + ggtitle("Normal vs T Quantiles") # plot 2 graphs together grid.arrange(g, h, ncol = 2) ``` * William Gosset's **t** Distribution ("Student's T distribution") * test = Gosset's pseudoname which he published under * indexed/defined by ***degrees of freedom***, and becomes more like standard normal as degrees of freedom gets larger * thicker tails centered around 0, thus confidence interval = ***wider*** than Z interval (more mass concentrated away from the center) * for ***small*** sample size (value of n), normalizing the distribution by $\frac{\bar X - \mu}{S/\sqrt{n}}$ $\rightarrow$ t distribution, ***not*** the standard normal distribution * $S$ = standard deviation may be inaccurate, as the std of the data sample may not be truly representative of the population std * using the Z interval here thus may produce an interval that is too ***narrow*** ### Confidence Interval - Paired T Tests * compare observations for the same subjects over two different sets of data (i.e. different times, different treatments) * the confidence interval is defined by $$ \bar X_1 - \bar X_2 \pm \frac{t_{n-1} S}{\sqrt{n}}$$ where $\bar X_1$ represents the first observations and $\bar X_2$ the second set of observations * `t.test(difference)` = performs group mean t test and returns metrics as results, which includes the confidence intervals - `t.test(g2, g1, paired = TRUE)` = performs the same paired t test with data directly * ***example*** - the data used here is for a study of the effects of two soporific drugs (increase in hours of sleep compared to control) on 10 patients ```{r fig.width = 4, fig.height = 3, fig.align = 'center', message = F, warning = F} # load data data(sleep) # plot the first and second observations g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID))) g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5) g # define groups g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20] # define difference difference <- g2 - g1 # calculate mean and sd of differences mn <- mean(difference); s <- sd(difference); n <- 10 # calculate intervals manually mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n) # perform the same test to get confidence intervals t.test(difference) t.test(g2, g1, paired = TRUE) ``` ### Independent Group **t** Intervals - Same Variance * compare two groups in randomized trial ("A/B Testing") * cannot use the paired t test because the groups are independent and may have different sample sizes * perform randomization to balance unobserved covariance that may otherwise affect the result * $t$ confidence interval for $\mu_y - \mu_x$ is defined as $$\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}$$ * $t_{n_x + n_y - 2, 1 - \alpha/2}$ = relevant quantile * $n_x + n_y - 2$ = degrees of freedom * $S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}$ = standard error * $S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$ = pooled variance estimator * this is effectively a weighted average between the two variances, such that different sample sizes are taken in to account * For equal sample sizes, $n_x = n_y$, $S_p^2 = \frac{S_x^2 + S_y^2}{2}$ (average of variance of two groups) * ***Note:** this interval assumes **constant variance** across two groups; if variance is different, use the next interval * ### Independent Group t Intervals - Different Variance * confidence interval for $\mu_y - \mu_x$ is defined as $$\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}$$ * $t_{df}$ = relevant quantile with df as defined below * ***Note**: normalized statistic does not follow t distribution but can be approximated through the formula with df defined below * $$df = \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)}$$ * $\left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}$ = standard error * Comparing other kinds of data * binomial $\rightarrow$ relative risk, risk difference, odds ratio * binomial $\rightarrow$ Chi-squared test, normal approximations, exact tests * count $\rightarrow$ Chi-squared test, exact tests * R commands * t Confidence Intervals * `mean + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)` * ***c(-1, 1)*** = plus and minus, $\pm$ * Difference Intervals (all equivalent) * `mean2 - mean1 + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)` * ***n*** = number of paired observations * ***qt(0.975, n - 1)*** = relevant quantile for paired * ***qt(0.975, n$_x$ + n$_y$ - 2)*** = relevant quantile for independent * `t.test(mean2 - mean1)` * `t.test(data2, data1, paired = TRUE, var.equal = TRUE)` * ***paired*** = whether or not the two sets of data are paired (same subjects different observations for treatment) $\rightarrow$ `TRUE` for paired, `FALSE` for independent * ***var.equal*** = whether or not the variance of the datasets should be treated as equal $\rightarrow$ `TRUE` for same variance, `FALSE` for unequal variances * `t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)` * ***relevel(factor, ref)*** = reorders the levels in the factor so that "ref" is changed to the first level $\rightarrow$ doing this here is so that the second set of measurements come first (1, 2 $\rightarrow$ 2, 1) in order to perform mean$_2$ - mean$_1$ * ***I(object)*** = prepend the class "AsIs" to the object * ***Note**: I(relevel(group, 2)) = explanatory variable, must be **factor** and have **two levels** * $\pagebreak$ ## Hypothesis Testing * Hypothesis testing = making decisions using data * **null** hypothesis (**H$_0$**) = status quo * assumed to be ***true*** $\rightarrow$ statistical evidence required to reject it for **alternative** or "research" hypothesis (**H$_a$**) * alternative hypothesis typically take form of >, < or $\ne$ * **Results** Truth | Decide | Result | ---|---|---| $H_0$ | $H_0$ | Correctly accept null | $H_0$ | $H_a$ | Type I error | $H_a$ | $H_a$ | Correctly reject null | $H_a$ | $H_0$ | Type II error | * **$\alpha$** = Type I error rate * probability of ***rejecting*** the null hypothesis when the hypothesis is ***correct*** * $\alpha$ = 0.05 $\rightarrow$ standard for hypothesis testing * ***Note**: as Type I error rate increases, Type II error rate decreases and vice versa * * for large samples (large n), use the **Z Test** for $H_0:\mu = \mu_0$ * **$H_a$:** * $H_1: \mu < \mu_0$ * $H_2: \mu \neq \mu_0$ * $H_3: \mu > \mu_0$ * Test statistic $TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ * Reject the null hypothesis $H_0$ when * $H_1: TS \leq Z_{\alpha}$ OR $-Z_{1 - \alpha}$ * $H_2: |TS| \geq Z_{1 - \alpha / 2}$ * $H_3: TS \geq Z_{1 - \alpha}$ * ***Note**: In case of $\alpha$ = 0.05 (most common), $Z_{1-\alpha}$ = 1.645 (95 percentile) * * $\alpha$ = low, so that when $H_0$ is rejected, original model $\rightarrow$ wrong or made an error (low probability) * For small samples (small n), use the **T Test** for $H_0:\mu = \mu_0$ * **$H_a$:** * $H_1: \mu < \mu_0$ * $H_2: \mu \neq \mu_0$ * $H_3: \mu > \mu_0$ * Test statistic $TS = \frac{\bar{X} - \mu_0}{S / \sqrt{n}}$ * Reject the null hypothesis $H_0$ when * $H_1: TS \leq T_{\alpha}$ OR $-T_{1 - \alpha}$ * $H_2: |TS| \geq T_{1 - \alpha / 2}$ * $H_3: TS \geq T_{1 - \alpha}$ * ***Note**: In case of $\alpha$ = 0.05 (most common), $T_{1-\alpha}$ = `qt(.95, df = n-1)` * * R commands for T test: * `t.test(vector1 - vector2)` * `t.test(vector1, vector2, paired = TRUE)` * `alternative` argument can be used to specify one-sided tests: `less` or `greater` * `alternative` default = `two-sided` * prints test statistic (`t`), degrees of freedom (`df`), `p-value`, 95% confidence interval, and mean of sample * confidence interval in units of data, and can be used to intepret the practical significance of the results * **rejection region** = region of TS values for which you reject $H_0$ * **power** = probability of rejecting $H_0$ * power is used to calculate sample size for experiments * **two-sided tests** $\rightarrow$ $H_a: \mu \neq \mu_0$ * reject $H_0$ only if test statistic is too larger/small * for $\alpha$ = 0.05, split equally to 2.5% for upper and 2.5% for lower tails * equivalent to $|TS| \geq T_{1 - \alpha / 2}$ * example: for T test, `qt(.975, df)` and `qt(.025, df)` * ***Note**: failing to reject one-sided test = fail to reject two-sided* * **tests vs confidence intervals** * ($1-\alpha$)% confidence interval for $\mu$ = set of all possible values that fail to reject $H_0$ * if ($1-\alpha$)% confidence interval contains $\mu_0$, fail to reject $H_0$ * **two-group intervals/test** * Rejection rules the same * Test $H_0$: $\mu_1 = \mu_2$ $\rightarrow$ $\mu_1 - \mu_2 = 0$ * Test statistic: $$\frac{Estimate - H_0 Value}{SE_{Estimate}} = \frac{\bar X_1 - \bar X_2 - 0}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}}$$ * R Command * `t.test(values ~ factor, paired = FALSE, var.equal = TRUE, data = data)` * `paired = FALSE` = independent values * `factor` argument must have only two levels * **p values** * most common measure of statistical significance * **p-value** = probability under the null hypothesis of obtaining evidence as extreme or more than that of the obtained * Given that $H_0$ is true, how likely is it to obtain the result (test statistic)? * **attained significance level** = smallest value for $\alpha$ for which $H_0$ is rejected $\rightarrow$ equivalent to p-value * if p-value < $\alpha$, reject $H_0$ * for two-sided tests, double the p-values * if p-value is small, either $H_0$ is true AND the obeserved is a rare event **OR** $H_0$ is false * R Command * p-value = `pt(statistic, df, lower.tail = FALSE)` * `lower.tail = FALSE` = returns the probability of getting a value from the t distribution that is larger than the test statistic * Binomial (coin flips) * probability of getting x results out of n trials and event probability of p = `pbinom(x, size = n, prob = p, lower.tail = FALSE)` * two-sided interval (testing for $\ne$): find the smaller of two one-sided intervals (X < value, X > value), and double the result * ***Note**: `lower.tail = FALSE` = strictly greater * * Poisson * probability of getting x results given the rate r = `ppois(x - 1, r, lower.tail = FALSE)` * `x - 1` is used here because the upper tail includes the specified number (since we want greater than x, we start at x - 1) * `r` = events that should occur given the rate (multiplied by 100 to yield an integer) * ***Note**: `lower.tail = FALSE` = strictly greater * $\pagebreak$ ## Power * **Power** = probability of rejecting the null hypothesis when it is false (the more power the better) * most often used in designing studies so that there's a reasonable chance to detect the alternative hypothesis if the alternative hypothesis is true * $\beta$ = probability of type II error = failing to reject the null hypothesis when it's false * power = $1 - \beta$ * ***example*** * $H_0: \mu = 30 \to \bar X \sim N(\mu_0, \sigma^2/n)$ * $H_a: \mu > 30 \to \bar X \sim N(\mu_a, \sigma^2/n)$ * Power: $$Power = P\left(\frac{\bar X - 30}{s /\sqrt{n}} > t_{1-\alpha,n-1} ~;~ \mu = \mu_a \right)$$ * ***Note**: the above function depends on value of $\mu_a$ * * ***Note**: as $\mu_a$ approaches 30, power approaches $\alpha$ * * assuming the sample mean is normally distributed, $H_0$ is rejected when $\frac{\bar X - 30}{\sigma /\sqrt{n}} > Z_{1-\alpha}$ * or, $\bar X > 30 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}$ * R commands: * `alpha = 0.05; z = qnorm(1-alpha)` $\rightarrow$ calculates $Z_{1-\alpha}$ * `pnorm(mu0 + z * sigma/sqrt(n), mean = mua, sd = sigma/sqrt(n), lower.tail = FALSE)` $\rightarrow$ calculates the probability of getting a sample mean that is larger than $Z_{1-\alpha} \frac{\sigma}{\sqrt{n}}$ given that the population mean is $\mu_a$ * ***Note**: using `mean = mu0` in the function would = $\alpha$ * * Power curve behavior * Power increases as $mu_a$ increases $\rightarrow$ we are more likely to detect the difference in $mu_a$ and $mu_0$ * Power increases as **n** increases $\rightarrow$ with more data, more likely to detect any alternative $mu_a$ ```{r, fig.align='center', fig.height=4, fig.width=6} library(ggplot2) mu0 = 30; mua = 32; sigma = 4; n = 16 alpha = 0.05 z = qnorm(1 - alpha) nseq = c(8, 16, 32, 64, 128) mu_a = seq(30, 35, by = 0.1) power = sapply(nseq, function(n) pnorm(mu0 + z * sigma / sqrt(n), mean = mu_a, sd = sigma / sqrt(n), lower.tail = FALSE) ) colnames(power) <- paste("n", nseq, sep = "") d <- data.frame(mu_a, power) library(reshape2) d2 <- melt(d, id.vars = "mu_a") names(d2) <- c("mu_a", "n", "power") g <- ggplot(d2, aes(x = mu_a, y = power, col = n)) + geom_line(size = 2) g ``` * **Solving for Power** * When testing $H_a : \mu > \mu_0$ (or $<$ or $\neq$) $$Power = 1 - \beta = P\left(\bar X > \mu_0 + Z_{1-\alpha} \frac{\sigma}{\sqrt{n}} ; \mu = \mu_a \right)$$ where $\bar X \sim N(\mu_a, \sigma^2 / n)$ * Unknowns = $\mu_a$, $\sigma$, $n$, $\beta$ * Knowns = $\mu_0$, $\alpha$ * Specify any 3 of the unknowns and you can solve for the remainder; most common are two cases 1. Given power desired, mean to detect, variance that we can tolerate, find the **n** to produce desired power (designing experiment/trial) 2. Given the size **n** of the sample, find the power that is achievable (finding the utility of experiment) * ***Note**: for $H_a: \mu \neq mu_0$, calculated one-sided power using $z_{1-\alpha / 2}$; however, the power calculation here exclusdes the probability of getting a large TS in the opposite direction of the truth, but this is only applicable when $\mu_a$ and $\mu_0$ are close together* * **Power Behavior** * Power increases as $\alpha$ becomes larger * Power of one-sided test $>$ power of associated two-sided test * Power increases as $\mu_a$ gets further away from $\mu_0$ * Power increases as **n** increases (sample mean has less variability) * Power increases as $\sigma$ decreases (again less variability) * Power usually depends only $\frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}$, and not $\mu_a$, $\sigma$, and $n$ * **effect size** = $\frac{\mu_a - \mu_0}{\sigma}$ $\rightarrow$ unit free, can be interpretted across settings * **T-test Power** * for Gossett's T test, $$Power = P\left(\frac{\bar X - \mu_0}{S/\sqrt{n}} > t_{1-\alpha, n-1} ; \mu = \mu_a \right)$$ * $\frac{\bar X - \mu_0}{S/\sqrt{n}}$ does not follow a t distribution if the true mean is $\mu_a$ and NOT $\mu_0$ $\rightarrow$ follows a non-central t distribution instead * `power.t.test` = evaluates the non-central t distribution and solves for a parameter given all others are specified * `power.t.test(n = 16, delta = 0.5, sd = 1, type = "one.sample", alt = "one.sided")$power` = calculates power with inputs of n, difference in means, and standard deviation * `delta` = argument for difference in means * ***Note**: since effect size = `delta/sd`, as `n`, `type`, and `alt` are held constant, any distribution with the same effect size will have the same power * * `power.t.test(power = 0.8, delta = 0.5, sd = 1, type = "one.sample", alt = "one.sided")$n` = calculates size n with inputs of power, difference in means, and standard deviation * ***Note**: n should always be rounded up (ceiling) * $\pagebreak$ ## Multiple Testing * Hypothesis testing/significant analysis commonly overused * correct for multiple testing to avoid false positives/conclusions (two key components) 1. error measure 2. correction * multiple testing is needed because of the increase in ubiquitous data collection technology and analysis * DNA sequencing machines * imaging patients in clinical studies * electronic medical records * individualized movement data (fitbit) ### Type of Errors | Actual $H_0$ = True | Actual $H_a$ = True | Total ----------------------|--------------|--------------|--------- Conclude $H_0$ = True (non-significant) | $U$ | $T$ | $m-R$ Conclude $H_a$ = True (significant) | $V$ | $S$ | $R$ **Total** | $m_0$ | $m-m_0$ | $m$ * **$m_0$** = number of true null hypotheses, or cases where $H_0$ = actually true (unknown) * **$m - m_0$** = number of true alternative hypotheses, or cases where $H_a$ = actually true (unknown) * **$R$** = number of null hypotheses rejected, or cases where $H_a$ = concluded to be true (measurable) * **$m - R$** = number of null hypotheses that failed to be rejected, or cases where $H_0$ = concluded to be true (measurable) * **$V$** = Type I Error / false positives, concludes $H_a$ = True when $H_0$ = actually True * **$T$** = Type II Error / false negatives, concludes $H_0$ = True when $H_a$ = actually True * **$S$** = true positives, concludes $H_a$ = True when $H_a$ = actually True * **$U$** = true negatives, concludes $H_0$ = True when $H_0$ = actually True ### Error Rates * ***false positive rate*** = rate at which false results are called significant $E[\frac{V}{m_0}]$ $\rightarrow$ average fraction of times that $H_a$ is claimed to be true when $H_0$ is actually true * ***Note**: mathematically equal to type I error rate $\rightarrow$ false positive rate is associated with a post-prior result, which is the expected number of false positives divided by the total number of hypotheses under the real combination of true and non-true null hypotheses (disregarding the "global null" hypothesis). Since the false positive rate is a parameter that is not controlled by the researcher, it cannot be identified with the significance level, which is what determines the type I error rate. * * ***family wise error rate (FWER)*** = probability of at least one false positive $Pr(V \ge 1)$ * ***false discovery rate (FDR)*** = rate at which claims of significance are false $E[\frac{V}{R}]$ * **controlling error rates (adjusting $\alpha$)** * false positive rate * if we call all $P<\alpha$ significant (reject $H_0$), we are expected to get $\alpha \times m$ false positives, where $m$ = total number of hypothesis test performed * with high values of $m$, false positive rate is very large as well * family-wise error rate (FWER) * controlling FWER = controlling the probability of even one false positive * *bonferroni* correction (oldest multiple testing correction) * for $m$ tests, we want $Pr(V \ge 1) < \alpha$ * calculate P-values normally, and deem them significant if and only if $P < \alpha_{fewer} = \alpha / m$ * easy to calculate, but tend to be very ***conservative*** * false discovery rate (FDR) * most popular correction = controlling FDR * for $m$ tests, we want $E[\frac{V}{R}] \le \alpha$ * calculate P-values normally and sort some from smallest to largest $\rightarrow$ $P_{(1)},P_{(1)}, ... , P_{(m)}$ * deem the P-values significant if $P_{(i)} \le \alpha \times \frac{i}{m}$ * easy to calculate, less conservative, but allows for more false positives and may behave strangely under dependence (related hypothesis tests/regression with different variables) * ***example*** * 10 P-values with $\alpha = 0.20$ ![alt text](figures/9.png) * **adjusting for p-values** * ***Note**: changing P-values will fundamentally change their properties but they can be used directly without adjusting $/alpha$ * * _bonferroni_ (FWER) * $P_i^{fewer} = max(mP_i, 1)$ $\rightarrow$ since p cannot exceed value of 1 * deem P-values significant if $P_i^{fewer} < \alpha$ * similar to controlling FWER ### Example ```{r} set.seed(1010093) pValues <- rep(NA,1000) for(i in 1:1000){ x <- rnorm(20) # First 500 beta=0, last 500 beta=2 if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)} # calculating p-values by using linear model; the [2, 4] coeff in result = pvalue pValues[i] <- summary(lm(y ~ x))$coeff[2,4] } # Controls false positive rate trueStatus <- rep(c("zero","not zero"),each=500) table(pValues < 0.05, trueStatus) # Controls FWER table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus) # Controls FDR (Benjamin Hochberg) table(p.adjust(pValues,method="BH") < 0.05,trueStatus) ``` $\pagebreak$ ## Resample Inference * **Bootstrap** = useful tool for constructing confidence intervals and caclulating standard errors for difficult statistics * ***principle*** = if a statistic's (i.e. median) sampling distribution is unknown, then use distribution defined by the data to approximate it * ***procedures*** 1. simulate $n$ observations **with replacement** from the observed data $\rightarrow$ results in $1$ simulated complete data set 2. calculate desired statistic (i.e. median) for each simulated data set 3. repeat the above steps $B$ times, resulting in $B$ simulated statistics 4. these statistics are approximately drawn from the sampling distribution of the true statistic of $n$ observations 5. perform one of the following * plot a histogram * calculate standard deviation of the statistic to estimate its standard error * take quantiles (2.5^th^ and 97.5^th^) as a confidence interval for the statistic ("_bootstrap CI_") * ***example*** * Bootstrap procedure for calculating confidence interval for the median from a data set of $n$ observations $\rightarrow$ approximate sampling distribution ```{r fig.align='center', fig.height=4, fig.width=6, message = F, warning = F} # load data library(UsingR); data(father.son) # observed dataset x <- father.son$sheight # number of simulated statistic B <- 1000 # generate samples resamples <- matrix( sample(x, # sample to draw frome n * B, # draw B datasets with n observations each replace = TRUE), # cannot draw n*B elements from x (has n elements) without replacement B, n) # arrange results into n x B matrix # (every row = bootstrap sample with n observations) # take median for each row/generated sample medians <- apply(resamples, 1, median) # estimated standard error of median sd(medians) # confidence interval of median quantile(medians, c(.025, .975)) # histogram of bootstraped samples hist(medians) ``` * ***Note:** better percentile bootstrap confidence interval = "bias corrected and accelerated interval" in `bootstrap` package* * **Permutation Tests** * ***procedures*** * compare groups of data and test the null hypothesis that the distribution of the observations from each group = same * ***Note**: if this is true, then group labels/divisions are irrelevant * * permute the labels for the groups * recalculate the statistic * Mean difference in counts * Geometric means * T statistic * Calculate the percentage of simulations where the simulated statistic was more extreme (toward the alternative) than the observed * ***variations*** Data type | Statistic | Test name ---|---|---| Ranks | rank sum | rank sum test Binary | hypergeometric prob | Fisher's exact test Raw data | | ordinary permutation test * ***Note**: randomization tests are exactly permutation tests, with a different motivation * * For matched data, one can randomize the signs * For ranks, this results in the **signed rank test** * Permutation strategies work for regression by permuting a regressor of interest * Permutation tests work very well in multivariate settings * ***example*** * we will compare groups **B** and **C** in this dataset for null hypothesis $H_0:$ there are no difference between the groups ```{r, fig.height=4, fig.width=6, echo=FALSE, fig.align='center'} # load data data(InsectSprays) # plot boxplot of dataset ggplot(InsectSprays, aes(spray, count, fill = spray)) + geom_boxplot() ``` * we will compare groups **B** and **C** in this dataset for null hypothesis $H_0:$ there are no difference between the groups ```{r} # subset to only "B" and "C" groups subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),] # values y <- subdata$count # labels group <- as.character(subdata$spray) # find mean difference between the groups testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"]) observedStat <- testStat(y, group) observedStat ``` * the observed difference between the groups is `r observedStat` * now we changed the resample the lables for groups **B** and **C** ```{r} # create 10000 permutations of the data with the labels' changed permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group))) # find the number of permutations whose difference that is bigger than the observed mean(permutations > observedStat) ``` * we created 1000 permutations from the observed dataset, and found ***no datasets*** with mean differences between groups **B** and **C** larger than the original data * therefore, p-value is very small and we can ***reject the null*** hypothesis with any reasonable $\alpha$ levels * below is the plot for the null distribution/permutations ```{r, echo= FALSE, fig.width=6, fig.height=4, fig.align='center'} # plot distribution of permutations ggplot(data.frame(permutations = permutations), aes(permutations)) + geom_histogram(fill = "lightblue", color = "black", binwidth = 1) + geom_vline(xintercept = observedStat, size = 2) ``` * as we can see from the black line, the observed difference/statistic is very far from the mean $\rightarrow$ likely 0 is ***not*** the true difference * with this information, formal confidence intervals can be constructed and p-values can be calculated