Overview

Probability

General Probability Rules

  • discovered by Russian mathematician Kolmogorov, also known as “Probability Calculus”
  • probability = function of any set of outcomes and assigns it a number between 0 and 1
    • \(0 \le P(E) \le 1\), where \(E\) = event
  • probability that nothing occurs = 0 (impossible, have to roll dice to create outcome), that something occurs is 1 (certain)
  • probability of outcome or event \(E\), \(P(E)\) = ratio of ways that \(E\) could occur to number of all possible outcomes or events
  • probability of something = 1 - probability of the opposite occurring
  • probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) = sum of respective probabilities

  • if \(A\) implies occurrence of \(B\), then \(P(A)\) occurring \(< P(B)\) occurring

  • for any two events, probability of at least one occurs = the sum of their probabilities - their intersection (in other words, probabilities can not be added simply if they have non-trivial intersection)

  • for independent events \(A\) and \(B\), \(P(A \:\cup\: B) = P(A) \times P(B)\)
  • for outcomes that can occur with different combination of events and these combinations are mutually exclusive, the \(P(E_{total}) = \sum P(E_{part})\)

Conditional Probability

  • let B = an event so that \(P(B) > 0\)
  • conditional probability of an event A, given B is defined as the probability that BOTH A and B occurring divided by the probability of B occurring \[P(A\:|\:B) = \frac{P(A \:\cap\: B)}{P(B)}\]
  • if A and B are independent, then \[P(A\:|\:B) = \frac{P(A)P(B)}{P(B)} = P(A)\]
  • example
    • for die roll, \(A = \{1\}\), \(B = \{1, 3, 5\}\), then \[P(1~|~Odd) = P(A\:|\:B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)}{P(B)} = \frac{1/6}{3/6} = \frac{1}{3}\]

Baye’s Rule

  • definition \[P(B\:|\:A) = \frac{P(A\:|\:B)P(B)}{P(A\:|\:B)P(B)+P(A\:|\:B^c)P(B^c)}\] where \(B^c\) = corresponding probability of event \(B\), \(P(B^c) = 1 - P(B)\)

Random Variables

Probability Mass Function (PMF)

  • evaluates the probability that the discrete random variable takes on a specific value
    • measures the chance of a particular outcome happening
    • always \(\ge\) 0 for every possible outcome
    • \(\sum\) possible values that the variable can take = 1
  • Bernoulli distribution example
    • X = 0 \(\rightarrow\) tails, X = 1 \(\rightarrow\) heads
      • X here represents potential outcome
    • \(P(X = x) = (\frac{1}{2})^x(\frac{1}{2})^{1-x}\) for \(X = 0, 1\)
      • \(x\) here represents a value we can plug into the PMF
      • general form \(\rightarrow\) \(p(x) = (\theta)^x(1-\theta)^{1-x}\)
  • dbinom(k, n, p) = return the probability of getting k successes out of n trials, given probability of success is p

Probability Density Function (PDF)

  • evaluates the probability that the continuous random variable takes on a specific value
    • always \(\ge\) 0 everywhere
    • total area under curve must = 1
  • areas under PDFs correspond to the probabilities for that random variable taking on that range of values (PMF)

  • but the probability of the variable taking a specific value = 0 (area of a line is 0)

* Note: the above is true because it is modeling random variables as if they have infinite precision, when in reality they do not

  • dnorm(), dgamma(), dpois(), dunif() = return probability of a certain value from the normal, Gamma, Poisson, and uniform distributions

Cumulative Distribution Function (CDF)

  • CDF of a random variable \(X\) = probability that the random variable is \(\le\) value \(x\)
    • \(F(x) = P(X \le x)\) = applies when \(X\) is discrete/continuous
  • PDF = derivative of CDF
    • integrate PDF \(\rightarrow\) CDF
      • integrate(function, lower=0, upper=1) \(\rightarrow\) can be used to evaluate integrals for a specified range
  • pbinom(), pnorm(), pgamma(), ppois(), punif() = returns the cumulative probabilities from 0 up to a specified value from the binomial, normal, Gamma, Poisson, and uniform distributions

Survival Function

  • survival function of a random variable \(X\) = probability the random variable \(> x\), complement of CDF
    • \(S(x) = P(X > x) = 1 - F(x)\), where \(F(x) =\) CDF

Quantile

  • the \(\alpha^{th}\) quantile of a distribution with distribution function F = point \(x_{\alpha}\)
    • \(F(x_{\alpha}) = \alpha\)
    • percentile = quantile with \(\alpha\) expressed as a percent
    • median = 50th percentile
    • \(\alpha\)% of the possible outcomes lie below it

  • qbeta(quantileInDecimals, 2, 1) = returns quantiles for beta distribution
    • works for qnorm(), qbinom(), qgamma(), qpois(), etc.
  • median estimated in this fashion = a population median
  • probability model connects data to population using assumptions
    • population median = estimand, sample median = estimator

Independence

  • two events \(A\) and \(B\) are independent if the following is true
    • \(P(A\:\cap\:B) = P(A)P(B)\)
    • \(P(A\:|\:B) = P(A)\)
  • two random variables \(X\) and \(Y\) are independent, if for any two sets, A and B, the following is true
    • \(P([X \in A]\cap[Y \in B]) = P(X \in A)P(Y \in B)\)
  • independence = statistically unrelated from one another
  • if \(A\) is independent of \(B\), then the following are true
    • \(A^c\) is independent of \(B\)
    • \(A\) is independent of \(B^c\)
    • \(A^c\) is independent of \(B^c\)

IID Random Variables

  • random variables are said to be IID if they are independent and identically distributed
    • independent = statistically unrelated from each other
    • identically distributed = all having been drawn from the same population distribution
  • IID random variables = default model for random samples = default starting point of inference

Diagnostic Test

Example

  • specificity of 98.5%, sensitivity = 99.7%, prevalence of disease = .1% \[\begin{aligned} P(D ~|~ +) & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + P(+~|~D^c)P(D^c)}\\ & = \frac{P(+~|~D)P(D)}{P(+~|~D)P(D) + \{1-P(-~|~D^c)\}\{1 - P(D)\}} \\ & = \frac{.997\times .001}{.997 \times .001 + .015 \times .999}\\ & = .062 \end{aligned}\]
  • low positive predictive value \(\rightarrow\) due to low prevalence of disease and somewhat modest specificity
    • suppose it was know that the subject uses drugs and has regular intercourse with an HIV infect partner (his probability of being + is higher than suspected)
    • evidence implied by a positive test result

Likelihood Ratios

  • diagnostic likelihood ratio of a positive test result is defined as \[DLR_+ = \frac{sensitivity}{1-specificity} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)}\]
  • diagnostic likelihood ratio of a negative test result is defined as \[DLR_- = \frac{1 - sensitivity}{specificity} = \frac{P(-\:|\:D)}{P(-\:|\:D^c)}\]
  • from Baye’s Rules, we can derive the positive predictive value and false positive value \[P(D\:|\:+) = \frac{P(+\:|\:D)P(D)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(1)}\] \[P(D^c\:|\:+) = \frac{P(+\:|\:D^c)P(D^c)}{P(+\:|\:D)P(D)+P(+\:|\:D^c)P(D^c)}~~~~~~\mbox{(2)}\]
  • if we divide equation \((1)\) over \((2)\), the quantities over have the same denominator so we get the following \[\frac{P(D\:|\:+)}{P(D^c\:|\:+)} = \frac{P(+\:|\:D)}{P(+\:|\:D^c)} \times \frac{P(D)}{P(D^c)}\] which can also be written as \[\mbox{post-test odds of D} = DLR_+ \times \mbox{pre-test odds of D}\]
    • odds = \(p/(1-p)\)
    • \(\frac{P(D)}{P(D^c)}\) = pre-test odds, or odds of disease in absence of test
    • \(\frac{P(D\:|\:+)}{P(+\:|\:D^c)}\) = post-test odds, or odds of disease given a positive test result
    • \(DLR_+\) = factor by which the odds in the presence of a positive test can be multiplied to obtain the post-test odds
    • \(DLR_-\) = relates the decrease in odds of disease after a negative result
  • following the previous example, for sensitivity of 0.997 and specificity of 0.985, so the diagnostic likelihood ratios are as follows \[DLR_+ = .997/(1-.985) = 66 ~~~~~~ DLR_- =(1-.997)/.985 = 0.003\]
    • this indicates that the result of the positive test is the odds of disease is 66 times the pretest odds

Expected Values/Mean

# load relevant packages
library(UsingR); data(galton); library(ggplot2)
# plot galton data
g <- ggplot(galton, aes(x = child))
# add histogram for children data
g <- g + geom_histogram(fill = "salmon", binwidth=1, aes(y=..density..), colour="black")
# add density smooth
g <- g + geom_density(size = 2)
# add vertical line
g <- g + geom_vline(xintercept = mean(galton$child), size = 2)
# print graph
g

nosim <- 1000
# simulate data for sample size 1 to 4
dat <- data.frame(
    x = c(sample(1 : 6, nosim, replace = TRUE),
        apply(matrix(sample(1 : 6, nosim * 2, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1 : 6, nosim * 3, replace = TRUE), nosim), 1, mean),
        apply(matrix(sample(1 : 6, nosim * 4, replace = TRUE), nosim), 1, mean)),
    size = factor(rep(1 : 4, rep(nosim, 4))))
# plot histograms of means by sample size
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, binwidth=.25, colour = "black")
g + facet_grid(. ~ size)

Variance

# generate x value ranges
xvals <- seq(-10, 10, by = .01)
# generate data from normal distribution for sd of 1 to 4
dat <- data.frame(
    y = c(dnorm(xvals, mean = 0, sd = 1),
        dnorm(xvals, mean = 0, sd = 2),
        dnorm(xvals, mean = 0, sd = 3),
        dnorm(xvals, mean = 0, sd = 4)),
    x = rep(xvals, 4),
    factor = factor(rep(1 : 4, rep(length(xvals), 4)))
)
# plot 4 lines for the different standard deviations
ggplot(dat, aes(x = x, y = y, color = factor)) + geom_line(size = 2)

Sample Variance

  • the sample variance is defined as \[S^2 = \frac{\sum_{i=1} (X_i - \bar X)^2}{n-1}\]

  • on the above line representing the population (in magenta), any subset of data (3 of 14 selected, marked in blue) will most likely have a variance that is lower than the population variance
  • dividing by \(n - 1\) will make the variance estimator larger to adjust for this fact \(\rightarrow\) leads to more accurate estimation \(\rightarrow\) \(S^2\) = so called unbiased estimate of population variance
    • \(S^2\) is a random variable, and therefore has an associated population distribution
      • \(E[S^2]\) = population variance, where \(S\) = sample standard deviation
      • as we see from the simulation results below, with more data, the distribution for \(S^2\) gets more concentrated around population variance
# specify number of simulations
nosim <- 10000;
# simulate data for various sample sizes
dat <- data.frame(
    x = c(apply(matrix(rnorm(nosim * 10), nosim), 1, var),
          apply(matrix(rnorm(nosim * 20), nosim), 1, var),
          apply(matrix(rnorm(nosim * 30), nosim), 1, var)),
    n = factor(rep(c("10", "20", "30"), c(nosim, nosim, nosim))) )
# plot density function for different sample size data
ggplot(dat, aes(x = x, fill = n)) + geom_density(size = 1, alpha = .2) +
    geom_vline(xintercept = 1, size = 1)

  • Note: for any variable, properties of the population = parameter, estimates of properties for samples = statistic
    • below is a summary for the mean and variance for population and sample

  • distribution for mean of random samples
    • expected value of the mean of distribution of means = expected value of the sample mean = population mean
      • \(E[\bar X]=\mu\)
    • expected value of the variance of distribution of means
      • \(Var(\bar X) = \sigma^2/n\)
      • as n becomes larger, the mean of random sample \(\rightarrow\) more concentrated around the population mean \(\rightarrow\) variance approaches 0
        • this again confirms that sample mean estimates population mean
    • Note: normally we only have 1 sample mean (from collected sample) and can estimate the variance \(\sigma^2\) \(\rightarrow\) so we know a lot about the distribution of the means from the data observed
  • standard error (SE)
    • the standard error of the mean is defined as \[SE_{mean} = \sigma/\sqrt{n}\]
    • this quantity is effectively the standard deviation of the distribution of a statistic (i.e. mean)
    • represents variability of means

Entire Estimator-Estimation Relationship

  • Start with a sample
  • \(S^2\) = sample variance
    • estimates how variable the population is
    • estimates population variance \(\sigma^2\)
    • \(S^2\) = a random variable and has its own distribution centered around \(\sigma^2\)
      • more concentrated around \(\sigma^2\) as \(n\) increases
  • \(\bar X\) = sample mean
    • estimates population mean \(\mu\)
    • \(\bar X\) = a random variable and has its own distribution centered around \(\mu\)
      • more concentrated around \(\mu\) as \(n\) increases
      • variance of distribution of \(\bar X = \sigma^2/n\)
      • estimate of variance = \(S^2/n\)
      • estimate of standard error = \(S/\sqrt{n}\) \(\rightarrow\) “sample standard error of the mean”
        • estimates how variable sample means (\(n\) size) from the population are

Example - Standard Normal

  • variance = 1
  • means of n standard normals (sample) have standard deviation = \(1/\sqrt{n}\)
# specify number of simulations with 10 as number of observations per sample
nosim <- 1000; n <-10
# estimated standard deviation of mean
sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))
## [1] 0.3144349
# actual standard deviation of mean of standard normals
1 / sqrt(n)
## [1] 0.3162278
  • rnorm() = generate samples from the standard normal
  • matrix() = puts all samples into a nosim by \(n\) matrix, so that each row represents a simulation with nosim observations
  • apply() = calculates the mean of the \(n\) samples
  • sd() = returns standard deviation

Example - Standard Uniform

  • standard uniform \(\rightarrow\) triangle straight line distribution \(\rightarrow\) mean = 1/2 and variance = 1/12
  • means of random samples of \(n\) uniforms have have standard deviation of \(1/\sqrt{12 \times n}\)
# estimated standard deviation of the sample means
sd(apply(matrix(runif(nosim * n), nosim), 1, mean))
## [1] 0.08643882
# actual standard deviation of the means
1/sqrt(12*n)
## [1] 0.09128709

Example - Poisson

  • \(Poisson(x^2)\) have variance of \(x^2\)
  • means of random samples of \(n~ Poisson(4)\) have standard deviation of \(2/\sqrt{n}\)
# estimated standard deviation of the sample means
sd(apply(matrix(rpois(nosim * n, lambda=4), nosim), 1, mean))
## [1] 0.6213867
# actual standard deviation of the means
2/sqrt(n)
## [1] 0.6324555

Example - Bernoulli

  • for \(p = 0.5\), the Bernoulli distribution has variance of 0.25
  • means of random samples of \(n\) coin flips have standard deviations of \(1 / (2 \sqrt{n})\)
# estimated standard deviation of the sample means
sd(apply(matrix(sample(0 : 1, nosim * n, replace = TRUE), nosim), 1, mean))
## [1] 0.1573651
# actual standard deviation of the means
1/(2*sqrt(n))
## [1] 0.1581139

Example - Father/Son

# load data
library(UsingR); data(father.son);
# define son height as the x variable
x <- father.son$sheight
# n is the length
n<-length(x)
# plot histogram for son's heights
g <- ggplot(data = father.son, aes(x = sheight))
g <- g + geom_histogram(aes(y = ..density..), fill = "lightblue", binwidth=1, colour = "black")
g <- g + geom_density(size = 2, colour = "black")
g

# we calculate the parameters for variance of distribution and sample mean,
round(c(sampleVar = var(x),
    sampleMeanVar = var(x) / n,
    # as well as standard deviation of distribution and sample mean
    sampleSd = sd(x),
    sampleMeanSd = sd(x) / sqrt(n)),2)
##     sampleVar sampleMeanVar      sampleSd  sampleMeanSd 
##          7.92          0.01          2.81          0.09

Binomial Distribution

Example

  • of 8 children, whats the probability of 7 or more girls (50/50 chance)? \[{8 \choose 7}.5^7(1-.5)^{1} + {8 \choose 8}.5^8(1-.5)^{0} \approx 0.04\]
# calculate probability using PMF
choose(8, 7) * .5 ^ 8 + choose(8, 8) * .5 ^ 8
## [1] 0.03515625
# calculate probability using CMF from distribution
pbinom(6, size = 8, prob = .5, lower.tail = FALSE)
## [1] 0.03515625
  • choose(8, 7) = R function to calculate n choose x
  • pbinom(6, size=8, prob =0.5, lower.tail=TRUE) = probability of 6 or less successes out of 8 samples with probability of 0.5 (CMF)
    • lower.tail=FALSE = returns the complement, in this case it’s the probability of greater than 6 successes out of 8 samples with probability of 0.5

Normal Distribution

# plot standard normal
x <- seq(-3, 3, length = 1000)
g <- ggplot(data.frame(x = x, y = dnorm(x)),
            aes(x = x, y = y)) + geom_line(size = 2)
g <- g + geom_vline(xintercept = -3 : 3, size = 2)
g

Example

  • the number of daily ad clicks for a company is (approximately) normally distributed with a mean of 1020 and a standard deviation of 50
  • What’s the probability of getting more than 1,160 clicks in a day?
# calculate number of standard deviations from the mean
(1160 - 1020) / 50
## [1] 2.8
# calculate probability using given distribution
pnorm(1160, mean = 1020, sd = 50, lower.tail = FALSE)
## [1] 0.00255513
# calculate probability using standard normal
pnorm(2.8, lower.tail = FALSE)
## [1] 0.00255513
  • therefore, it is not very likely (0.255513% chance), since 1,160 is 2.8 standard deviations from the mean
  • What number of daily ad clicks would represent the one where 75% of days have fewer clicks (assuming days are independent and identically distributed)?
qnorm(0.75, mean = 1020, sd = 50)
## [1] 1053.724
  • therefore, 1053.7244875 would represent the threshold that has more clicks than 75% of days

Poisson Distribution

Example

  • number of people that show up at a bus stop can be modeled with Poisson distribution with a mean of 2.5 per hour
  • after watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?
# calculate using distribution
ppois(3, lambda = 2.5 * 4)
## [1] 0.01033605
  • as we can see from above, there is a 1.0336051% chance for 3 or fewer people show up total at the bus stop during 4 hours of monitoring

Example - Approximating Binomial Distribution

  • flip a coin with success probability of 0.01 a total 500 times (low \(p\), large \(n\))
  • what’s the probability of 2 or fewer successes?
# calculate correct probability from Binomial distribution
pbinom(2, size = 500, prob = .01)
## [1] 0.1233858
# estimate probability using Poisson distribution
ppois(2, lambda=500 * .01)
## [1] 0.124652
  • as we can see from above, the two probabilities (12.3385774% vs 12.3385774%) are extremely close

Asymptotics

Law of Large Numbers (LLN)

  • IID sample statistic that estimates property of the sample (i.e. mean, variance) becomes the population statistic (i.e. population mean, population variance) as \(n\) increases
  • Note: an estimator is consistent if it converges to what it is estimating
  • sample mean/variance/standard deviation are all consistent estimators for their population counterparts
    • \(\bar X_n\) is average of the result of \(n\) coin flips (i.e. the sample proportion of heads)
    • as we flip a fair coin over and over, it eventually converges to the true probability of a head

Example - LLN for Normal and Bernoulli Distribution

  • for this example, we will simulate 10000 samples from the normal and Bernoulli distributions respectively
  • we will plot the distribution of sample means as \(n\) increases and compare it to the population means
# load library
library(gridExtra)
# specify number of trials
n <- 10000
# calculate sample (from normal distribution) means for different size of n
means <- cumsum(rnorm(n)) / (1  : n)
# plot sample size vs sample mean
g <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0) + geom_line(size = 2)
g <- g + labs(x = "Number of obs", y = "Cumulative mean")
g <- g + ggtitle("Normal Distribution")
# calculate sample (coin flips) means for different size of n
means <- cumsum(sample(0 : 1, n , replace = TRUE)) / (1  : n)
# plot sample size vs sample mean
p <- ggplot(data.frame(x = 1 : n, y = means), aes(x = x, y = y))
p <- p + geom_hline(yintercept = 0.5) + geom_line(size = 2)
p <- p + labs(x = "Number of obs", y = "Cumulative mean")
p <- p + ggtitle("Bernoulli Distribution (Coin Flip)")
# combine plots
grid.arrange(g, p, ncol = 2)

  • as we can see from above, for both distributions the sample means undeniably approach the respective population means as \(n\) increases

Central Limit Theorem

  • one of the most important theorems in statistics
  • distribution of means of IID variables approaches the standard normal as sample size \(n\) increases
  • in other words, for large values of \(n\), \[\frac{\mbox{Estimate} - \mbox{Mean of Estimate}}{\mbox{Std. Err. of Estimate}} = \frac{\bar X_n - \mu}{\sigma / \sqrt{n}}=\frac{\sqrt n (\bar X_n - \mu)}{\sigma} \longrightarrow N(0, 1)\]
  • this translates to the distribution of the sample mean \(\bar X_n\) is approximately \(N(\mu, \sigma^2/n)\)
    • distribution is centered at the population mean
    • with standard deviation = standard error of the mean
  • typically the Central Limit Theorem can be applied when \(n \geq 30\)

Example - CLT with Bernoulli Trials (Coin Flips)

  • for this example, we will simulate \(n\) flips of a possibly unfair coin
    • let \(X_i\) be the 0 or 1 result of the \(i^{th}\) flip of a possibly unfair coin
    • sample proportion , \(\hat p\), is the average of the coin flips
    • \(E[X_i] = p\) and \(Var(X_i) = p(1-p)\)
    • standard error of the mean is \(SE = \sqrt{p(1-p)/n}\)
  • in principle, normalizing the random variable \(X_i\), we should get an approximately standard normal distribution \[\frac{\hat p - p}{\sqrt{p(1-p)/n}} \sim N(0,~1)\]
  • therefore, we will flip a coin \(n\) times, take the sample proportion of heads (successes with probability \(p\)), subtract off 0.5 (ideal sample proportion) and multiply the result by \(\frac{1}{2 \sqrt{n}}\) and compare it to the standard normal

  • now, we can run the same simulation trials for an extremely unfair coin with \(p\) = 0.9

  • as we can see from both simulations, the converted/standardized distribution of the samples convert to the standard normal distribution
  • Note: speed at which the normalized coin flips converge to normal distribution depends on how biased the coin is (value of \(p\))
  • Note: does not guarantee that the normal distribution will be a good approximation, but just that eventually it will be a good approximation as n \(\rightarrow \infty\)

Confidence Intervals - Normal Distribution/Z Intervals

  • Z confidence interval is defined as \[Estimate \pm ZQ \times SE_{Estimate}\] where \(ZQ\) = quantile from the standard normal distribution
  • according to CLT, the sample mean, \(\bar X\), is approximately normal with mean \(\mu\) and sd \(\sigma / \sqrt{n}\)
  • 95% confidence interval for the population mean \(\mu\) is defined as \[\bar X \pm 2\sigma/\sqrt{n}\] for the sample mean \(\bar X \sim N(\mu, \sigma^2/n)\)
    • you can choose to use 1.96 to be more accurate for the confidence interval
    • \(P(\bar{X} > \mu + 2\sigma/\sqrt{n}~or~\bar{X} < \mu - 2\sigma/\sqrt{n}) = 5\%\)
    • interpretation: if we were to repeatedly draw samples of size \(n\) from the population and construct this confidence interval for each case, approximately 95% of the intervals will contain \(\mu\)
  • confidence intervals get narrower with less variability or larger sample sizes
  • Note: Poisson and binomial distributions have exact intervals that don’t require CLT
  • example
    • for this example, we will compute the 95% confidence interval for sons height data in inches
# load son height data
data(father.son); x <- father.son$sheight
# calculate confidence interval for sons height in inches
mean(x) + c(-1, 1) * qnorm(0.975) * sd(x)/sqrt(length(x))
## [1] 68.51605 68.85209

Confidence Interval - Bernoulli Distribution/Wald Interval

  • for Bernoulli distributions, \(X_i\) is 0 or 1 with success probability \(p\) and the variance is \(\sigma^2 = p(1 - p)\)
  • the confidence interval takes the form of \[\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]
  • since the population proportion \(p\) is unknown, we can use the sampled proportion of success \(\hat{p} = X/n\) as estimate
  • \(p(1-p)\) is largest when \(p = 1/2\), so 95% confidence interval can be calculated by \[\begin{aligned} \hat{p} \pm Z_{0.95} \sqrt{\frac{0.5(1-0.5)}{n}} & = \hat{p} \pm qnorm(.975) \sqrt{\frac{1}{4n}}\\ & = \hat{p} \pm 1.96 \sqrt{\frac{1}{4n}}\\ & = \hat{p} \pm \frac{1.96}{2} \sqrt{\frac{1}{n}}\\ & \approx \hat{p} \pm \frac{1}{\sqrt{n}}\\ \end{aligned}\]
    • this is known as the Wald Confidence Interval and is useful in roughly estimating confidence intervals
    • generally need \(n\) = 100 for 1 decimal place, 10,000 for 2, and 1,000,000 for 3
  • example
    • suppose a random sample of 100 likely voters, 56 intent to vote for you, can you secure a victory?
    • we can use the Wald interval to quickly estimate the 95% confidence interval
    • as we can see below, because the interval [0.46, 0.66] contains values below 50%, victory is not guaranteed
    • binom.test(k, n)$conf = returns confidence interval binomial distribution (collection of Bernoulli trial) with k successes in n draws
# define sample probability and size
p = 0.56; n = 100
# Wald interval
c("WaldInterval" = p + c(-1, 1) * 1/sqrt(n))
## WaldInterval1 WaldInterval2 
##          0.46          0.66
# 95% confidence interval
c("95CI" = p + c(-1, 1) * qnorm(.975) * sqrt(p * (1-p)/n))
##     95CI1     95CI2 
## 0.4627099 0.6572901
# perform binomial test
binom.test(p*100, n*100)$conf.int
## [1] 0.004232871 0.007265981
## attr(,"conf.level")
## [1] 0.95

Confidence Interval - Binomial Distribution/Agresti-Coull Interval

  • for a binomial distribution with smaller values of \(n\) (when \(n\) < 30, thus not large enough for CLT), often time the normal confidence intervals, as defined by \[\hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\] do not provide accurate estimates
# simulate 1000 samples of size 20 each
n <- 20; nosim <- 1000
# simulate for p values from 0.1 to 0.9
pvals <- seq(.1, .9, by = .05)
# calculate the confidence intervals
coverage <- sapply(pvals, function(p){
    # simulate binomial data
    phats <- rbinom(nosim, prob = p, size = n) / n
    # calculate lower 95% CI bound
    ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate upper 95% CI bound
    ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate percent of intervals that contain p
    mean(ll < p & ul > p)
})
# plot CI results vs 95%
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0)

  • as we can see from above, the interval do not provide adequate coverage as 95% confidence intervals (frequently only provide 80 to 90% coverage)
  • we can construct the Agresti-Coull Interval, which is defined uses the adjustment \[\hat{p} = \frac{X+2}{n+4}\] where we effectively add 2 to number of successes, \(X\), and add 2 to number of failure
  • therefore the interval becomes \[\frac{X+2}{n+4} \pm z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}}\]
  • Note: interval tend to be conservative
  • example
# simulate 1000 samples of size 20 each
n <- 20; nosim <- 1000
# simulate for p values from 0.1 to 0.9
pvals <- seq(.1, .9, by = .05)
# calculate the confidence intervals
coverage <- sapply(pvals, function(p){
    # simulate binomial data with Agresti/Coull Interval adjustment
    phats <- (rbinom(nosim, prob = p, size = n) + 2) / (n + 4)
        # calculate lower 95% CI bound
    ll <- phats - qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate upper 95% CI bound
    ul <- phats + qnorm(.975) * sqrt(phats * (1 - phats) / n)
    # calculate percent of intervals that contain p
    mean(ll < p & ul > p)
})
# plot CI results vs 95%
ggplot(data.frame(pvals, coverage), aes(x = pvals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(.75, 1.0)

  • as we can see from above, the coverage is much better for the 95% interval
  • in fact, all of the estimates are more conservative as we previously discussed, indicating the Agresti-Coull intervals are wider than the regular confidence intervals

Confidence Interval - Poisson Interval

  • for \(X \sim Poisson(\lambda t)\)
    • estimate rate \(\hat{\lambda} = X/t\)
    • \(var(\hat{\lambda}) = \lambda/t\)
    • variance estimate \(= \hat{\lambda}/t\)
  • so the confidence interval is defined as \[\hat \lambda \pm z_{1-\alpha/2}\sqrt{\frac{\lambda}{t}}\]
    • however, for small values of \(\lambda\) (few events larger time interval), we should not use the asymptotic interval estimated
    • example
      • for this example, we will go through a specific scenario as well as a simulation exercise to demonstrate the ineffectiveness of asymptotic intervals for small values of \(\lambda\)
      • nuclear pump failed 5 times out of 94.32 days, give a 95% confidence interval for the failure rate per day?
      • poisson.test(x, T)$conf = returns Poisson 95% confidence interval for given x occurrence over T time period
# define parameters
x <- 5; t <- 94.32; lambda <- x / t
# calculate confidence interval
round(lambda + c(-1, 1) * qnorm(.975) * sqrt(lambda / t), 3)
## [1] 0.007 0.099
# return accurate confidence interval from poisson.test
poisson.test(x, T = 94.32)$conf
## [1] 0.01721254 0.12371005
## attr(,"conf.level")
## [1] 0.95
# small lambda simulations
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 100
# calculate coverage using Poisson intervals
coverage <- sapply(lambdavals, function(lambda){
    # calculate Poisson rates
    lhats <- rpois(nosim, lambda = lambda * t) / t
    # lower bound of 95% CI
    ll <- lhats - qnorm(.975) * sqrt(lhats / t)
    # upper bound of 95% CI
    ul <- lhats + qnorm(.975) * sqrt(lhats / t)
    # calculate percent of intervals that contain lambda
    mean(ll < lambda & ul > lambda)
})
# plot CI results vs 95%
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95)+ylim(0, 1.0)

  • as we can see above, for small values of \(\lambda = X/t\), the confidence interval produced by the asymptotic interval is not an accurate estimate of the actual 95% interval (not enough coverage)
  • however, as \(t \to \infty\), the interval becomes the true 95% interval
# small lambda simulations
lambdavals <- seq(0.005, 0.10, by = .01); nosim <- 1000; t <- 1000
# calculate coverage using Poisson intervals
coverage <- sapply(lambdavals, function(lambda){
    # calculate Poisson rates
    lhats <- rpois(nosim, lambda = lambda * t) / t
    # lower bound of 95% CI
    ll <- lhats - qnorm(.975) * sqrt(lhats / t)
    # upper bound of 95% CI
    ul <- lhats + qnorm(.975) * sqrt(lhats / t)
    # calculate percent of intervals that contain lambda
    mean(ll < lambda & ul > lambda)
})
# plot CI results vs 95%
ggplot(data.frame(lambdavals, coverage), aes(x = lambdavals, y = coverage)) + geom_line(size = 2) + geom_hline(yintercept = 0.95) + ylim(0, 1.0)

  • as we can see from above, as \(t\) increases, the Poisson intervals become closer to the actual 95% confidence intervals

Confidence Intervals - T Distribution(Small Samples)

  • t confidence interval is defined as \[Estimate \pm TQ \times SE_{Estimate} = \bar X \pm \frac{t_{n-1} S}{\sqrt{n}}\]
    • \(TQ\) = quantile from T distribution
    • \(t_{n-1}\) = relevant quantile
    • \(t\) interval assumes data is IID normal so that \[\frac{\bar X - \mu}{S/\sqrt{n}}\] follows Gosset’s \(t\) distribution with \(n-1\) degrees of freedom
    • works well with data distributions that are roughly symmetric/mound shaped, and does not work with skewed distributions
      • skewed distribution \(\rightarrow\) meaningless to center interval around the mean \(\bar X\)
      • logs/median can be used instead
    • paired observations (multiple measurements from same subjects) can be analyzed by t interval of differences
    • as more data collected (large degrees of freedom), t interval \(\rightarrow\) z interval
    • qt(0.975, df=n-1) = calculate the relevant quantile using t distribution
# Plot normal vs t distributions
k <- 1000; xvals <- seq(-5, 5, length = k); df <- 10
d <- data.frame(y = c(dnorm(xvals), dt(xvals, df)),x = xvals,
              dist = factor(rep(c("Normal", "T"), c(k,k))))
g <- ggplot(d, aes(x = x, y = y))
g <- g + geom_line(size = 2, aes(colour = dist)) + ggtitle("Normal vs T Distribution")
# plot normal vs t quantiles
d <- data.frame(n= qnorm(pvals),t=qt(pvals, df),p = pvals)
h <- ggplot(d, aes(x= n, y = t))
h <- h + geom_abline(size = 2, col = "lightblue")
h <- h + geom_line(size = 2, col = "black")
h <- h + geom_vline(xintercept = qnorm(0.975))
h <- h + geom_hline(yintercept = qt(0.975, df)) + ggtitle("Normal vs T Quantiles")
# plot 2 graphs together
grid.arrange(g, h, ncol = 2)

  • William Gosset’s t Distribution (“Student’s T distribution”)
    • test = Gosset’s pseudoname which he published under
    • indexed/defined by degrees of freedom, and becomes more like standard normal as degrees of freedom gets larger
    • thicker tails centered around 0, thus confidence interval = wider than Z interval (more mass concentrated away from the center)
    • for small sample size (value of n), normalizing the distribution by \(\frac{\bar X - \mu}{S/\sqrt{n}}\) \(\rightarrow\) t distribution, not the standard normal distribution
      • \(S\) = standard deviation may be inaccurate, as the std of the data sample may not be truly representative of the population std
      • using the Z interval here thus may produce an interval that is too narrow

Confidence Interval - Paired T Tests

  • compare observations for the same subjects over two different sets of data (i.e. different times, different treatments)
  • the confidence interval is defined by \[ \bar X_1 - \bar X_2 \pm \frac{t_{n-1} S}{\sqrt{n}}\] where \(\bar X_1\) represents the first observations and \(\bar X_2\) the second set of observations
  • t.test(difference) = performs group mean t test and returns metrics as results, which includes the confidence intervals
    • t.test(g2, g1, paired = TRUE) = performs the same paired t test with data directly
  • example
    • the data used here is for a study of the effects of two soporific drugs (increase in hours of sleep compared to control) on 10 patients
# load data
data(sleep)
# plot the first and second observations
g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID)))
g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5)
g

# define groups
g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
# define difference
difference <- g2 - g1
# calculate mean and sd of differences
mn <- mean(difference); s <- sd(difference); n <- 10
# calculate intervals manually
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## [1] 0.7001142 2.4598858
# perform the same test to get confidence intervals
t.test(difference)
## 
##  One Sample t-test
## 
## data:  difference
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of x 
##      1.58
t.test(g2, g1, paired = TRUE)
## 
##  Paired t-test
## 
## data:  g2 and g1
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of the differences 
##                    1.58

Independent Group t Intervals - Same Variance

  • compare two groups in randomized trial (“A/B Testing”)
  • cannot use the paired t test because the groups are independent and may have different sample sizes
  • perform randomization to balance unobserved covariance that may otherwise affect the result
  • \(t\) confidence interval for \(\mu_y - \mu_x\) is defined as \[\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}\]
    • \(t_{n_x + n_y - 2, 1 - \alpha/2}\) = relevant quantile
    • \(n_x + n_y - 2\) = degrees of freedom
    • \(S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}\) = standard error
    • \(S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)\) = pooled variance estimator
      • this is effectively a weighted average between the two variances, such that different sample sizes are taken in to account
      • For equal sample sizes, \(n_x = n_y\), \(S_p^2 = \frac{S_x^2 + S_y^2}{2}\) (average of variance of two groups)
    • Note: this interval assumes constant variance across two groups; if variance is different, use the next interval

Independent Group t Intervals - Different Variance

  • confidence interval for \(\mu_y - \mu_x\) is defined as \[\bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}\]
    • \(t_{df}\) = relevant quantile with df as defined below
    • Note: normalized statistic does not follow t distribution but can be approximated through the formula with df defined below \[df = \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)}\]
      • \(\left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2}\) = standard error
  • Comparing other kinds of data
    • binomial \(\rightarrow\) relative risk, risk difference, odds ratio
    • binomial \(\rightarrow\) Chi-squared test, normal approximations, exact tests
    • count \(\rightarrow\) Chi-squared test, exact tests
  • R commands
    • t Confidence Intervals
      • mean + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)
        • c(-1, 1) = plus and minus, \(\pm\)
    • Difference Intervals (all equivalent)
      • mean2 - mean1 + c(-1, 1) * qt(0.975, n - 1) * std / sqrt(n)
        • n = number of paired observations
        • qt(0.975, n - 1) = relevant quantile for paired
        • qt(0.975, n\(_x\) + n\(_y\) - 2) = relevant quantile for independent
      • t.test(mean2 - mean1)
      • t.test(data2, data1, paired = TRUE, var.equal = TRUE)
        • paired = whether or not the two sets of data are paired (same subjects different observations for treatment) \(\rightarrow\) TRUE for paired, FALSE for independent
        • var.equal = whether or not the variance of the datasets should be treated as equal \(\rightarrow\) TRUE for same variance, FALSE for unequal variances
      • t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
        • relevel(factor, ref) = reorders the levels in the factor so that “ref” is changed to the first level \(\rightarrow\) doing this here is so that the second set of measurements come first (1, 2 \(\rightarrow\) 2, 1) in order to perform mean\(_2\) - mean\(_1\)
        • I(object) = prepend the class “AsIs” to the object
        • Note: I(relevel(group, 2)) = explanatory variable, must be factor and have two levels

Hypothesis Testing

Truth Decide Result
\(H_0\) \(H_0\) Correctly accept null
\(H_0\) \(H_a\) Type I error
\(H_a\) \(H_a\) Correctly reject null
\(H_a\) \(H_0\) Type II error

Power

library(ggplot2)
mu0 = 30; mua = 32; sigma = 4; n = 16
alpha = 0.05
z = qnorm(1 - alpha)
nseq = c(8, 16, 32, 64, 128)
mu_a = seq(30, 35, by = 0.1)
power = sapply(nseq, function(n)
    pnorm(mu0 + z * sigma / sqrt(n), mean = mu_a, sd = sigma / sqrt(n),
          lower.tail = FALSE)
    )
colnames(power) <- paste("n", nseq, sep = "")
d <- data.frame(mu_a, power)
library(reshape2)
d2 <- melt(d, id.vars = "mu_a")
names(d2) <- c("mu_a", "n", "power")
g <- ggplot(d2,
            aes(x = mu_a, y = power, col = n)) + geom_line(size = 2)
g

Multiple Testing

Type of Errors

Actual \(H_0\) = True Actual \(H_a\) = True Total
Conclude \(H_0\) = True (non-significant) \(U\) \(T\) \(m-R\)
Conclude \(H_a\) = True (significant) \(V\) \(S\) \(R\)
Total \(m_0\) \(m-m_0\) \(m\)
  • \(m_0\) = number of true null hypotheses, or cases where \(H_0\) = actually true (unknown)
  • \(m - m_0\) = number of true alternative hypotheses, or cases where \(H_a\) = actually true (unknown)
  • \(R\) = number of null hypotheses rejected, or cases where \(H_a\) = concluded to be true (measurable)
  • \(m - R\) = number of null hypotheses that failed to be rejected, or cases where \(H_0\) = concluded to be true (measurable)
  • \(V\) = Type I Error / false positives, concludes \(H_a\) = True when \(H_0\) = actually True
  • \(T\) = Type II Error / false negatives, concludes \(H_0\) = True when \(H_a\) = actually True
  • \(S\) = true positives, concludes \(H_a\) = True when \(H_a\) = actually True
  • \(U\) = true negatives, concludes \(H_0\) = True when \(H_0\) = actually True

Error Rates

  • false positive rate = rate at which false results are called significant \(E[\frac{V}{m_0}]\) \(\rightarrow\) average fraction of times that \(H_a\) is claimed to be true when \(H_0\) is actually true
    • Note: mathematically equal to type I error rate \(\rightarrow\) false positive rate is associated with a post-prior result, which is the expected number of false positives divided by the total number of hypotheses under the real combination of true and non-true null hypotheses (disregarding the “global null” hypothesis). Since the false positive rate is a parameter that is not controlled by the researcher, it cannot be identified with the significance level, which is what determines the type I error rate.
  • family wise error rate (FWER) = probability of at least one false positive \(Pr(V \ge 1)\)
  • false discovery rate (FDR) = rate at which claims of significance are false \(E[\frac{V}{R}]\)

  • controlling error rates (adjusting \(\alpha\))
    • false positive rate
      • if we call all \(P<\alpha\) significant (reject \(H_0\)), we are expected to get \(\alpha \times m\) false positives, where \(m\) = total number of hypothesis test performed
      • with high values of \(m\), false positive rate is very large as well
    • family-wise error rate (FWER)
      • controlling FWER = controlling the probability of even one false positive
      • bonferroni correction (oldest multiple testing correction)
        • for \(m\) tests, we want \(Pr(V \ge 1) < \alpha\)
        • calculate P-values normally, and deem them significant if and only if \(P < \alpha_{fewer} = \alpha / m\)
      • easy to calculate, but tend to be very conservative
    • false discovery rate (FDR)
      • most popular correction = controlling FDR
      • for \(m\) tests, we want \(E[\frac{V}{R}] \le \alpha\)
      • calculate P-values normally and sort some from smallest to largest \(\rightarrow\) \(P_{(1)},P_{(1)}, ... , P_{(m)}\)
      • deem the P-values significant if \(P_{(i)} \le \alpha \times \frac{i}{m}\)
      • easy to calculate, less conservative, but allows for more false positives and may behave strangely under dependence (related hypothesis tests/regression with different variables)
    • example
      • 10 P-values with \(\alpha = 0.20\)
      alt text
  • adjusting for p-values
    • Note: changing P-values will fundamentally change their properties but they can be used directly without adjusting \(/alpha\)
    • bonferroni (FWER)
      • \(P_i^{fewer} = max(mP_i, 1)\) \(\rightarrow\) since p cannot exceed value of 1
      • deem P-values significant if \(P_i^{fewer} < \alpha\)
      • similar to controlling FWER

Example

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  x <- rnorm(20)
  # First 500 beta=0, last 500 beta=2
  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
  # calculating p-values by using linear model; the [2, 4] coeff in result = pvalue
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
##        trueStatus
##         not zero zero
##   FALSE        0  476
##   TRUE       500   24
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
##        trueStatus
##         not zero zero
##   FALSE       23  500
##   TRUE       477    0
# Controls FDR (Benjamin Hochberg)
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
##        trueStatus
##         not zero zero
##   FALSE        0  487
##   TRUE       500   13

Resample Inference

# load data
library(UsingR); data(father.son)
# observed dataset
x <- father.son$sheight
# number of simulated statistic
B <- 1000
# generate samples
resamples <- matrix(
    sample(x,               # sample to draw frome
           n * B,           # draw B datasets with n observations each
           replace = TRUE), # cannot draw n*B elements from x (has n elements) without replacement
    B, n)                   # arrange results into n x B matrix
                            # (every row = bootstrap sample with n observations)
# take median for each row/generated sample
medians <- apply(resamples, 1, median)
# estimated standard error of median
sd(medians)
## [1] 0.76595
# confidence interval of median
quantile(medians, c(.025, .975))
##     2.5%    97.5% 
## 67.18292 70.16488
# histogram of bootstraped samples
hist(medians)

# subset to only "B" and "C" groups
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
# values
y <- subdata$count
# labels
group <- as.character(subdata$spray)
# find mean difference between the groups
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
observedStat
## [1] 13.25
# create 10000 permutations of the data with the labels' changed
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
# find the number of permutations whose difference that is bigger than the observed
mean(permutations > observedStat)
## [1] 0