Introduction to Regression

Notation

Empirical/Sample Mean

  • empirical mean is defined as \[\bar X = \frac{1}{n}\sum_{i=1}^n X_i\]
  • centering the random variable is defined as \[\tilde X_i = X_i - \bar X\]
    • mean of \(\tilde X_i\) = 0

Empirical/Sample Standard Deviation & Variance

  • empirical variance is defined as \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 = \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right) \Leftarrow \mbox{shortcut for calculation}\]
  • empirical standard deviation is defined as \(S = \sqrt{S^2}\)
    • average squared distances between the observations and the mean
    • has same units as the data
  • scaling the random variables is defined as \(X_i / S\)
    • standard deviation of \(X_i / S\) = 1

Normalization

  • normalizing the data/random variable is defined \[Z_i = \frac{X_i - \bar X}{s}\]
    • empirical mean = 0, empirical standard deviation = 1
    • distribution centered around 0 and data have units = # of standard deviations away from the original mean
      • example: \(Z_1 = 2\) means that the data point is 2 standard deviations larger than the original mean
  • normalization makes non-comparable data comparable

Empirical Covariance & Correlation

  • Let \((X_i, Y_i)\) = pairs of data
  • empirical covariance is defined as \[ Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y) = \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right) \]
    • has units of \(X \times\) units of \(Y\)
  • correlation is defined as \[Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}\] where \(S_x\) and \(S_y\) are the estimates of standard deviations for the \(X\) observations and \(Y\) observations, respectively
    • the value is effectively the covariance standardized into a unit-less quantity
    • \(Cor(X, Y) = Cor(Y, X)\)
    • \(-1 \leq Cor(X, Y) \leq 1\)
    • \(Cor(X,Y) = 1\) and \(Cor(X, Y) = -1\) only when the \(X\) or \(Y\) observations fall perfectly on a positive or negative sloped line, respectively
    • \(Cor(X, Y)\) measures the strength of the linear relationship between the \(X\) and \(Y\) data, with stronger relationships as \(Cor(X,Y)\) heads towards -1 or 1
    • \(Cor(X, Y) = 0\) implies no linear relationship

Dalton’s Data and Least Squares

# load necessary packages/install if needed
library(ggplot2); library(UsingR); data(galton)
# function to plot the histograms
myHist <- function(mu){
    # calculate the mean squares
    mse <- mean((galton$child - mu)^2)
    # plot histogram
    g <- ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon",
        colour = "black", binwidth=1)
    # add vertical line marking the center value mu
    g <- g + geom_vline(xintercept = mu, size = 2)
    g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = ""))
    g
}
# manipulate allows the user to change the variable mu to see how the mean squares changes
#   library(manipulate); manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))]
# plot the correct graph
myHist(mean(galton$child))

library(dplyr)
# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
# filter to only meaningful combinations
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 20), guide = "none" )
# plot grey circles slightly larger than data as base (achieve an outline effect)
g <- g + geom_point(colour="grey50", aes(size = freq+10, show_guide = FALSE))
# plot the accurate data points
g <- g + geom_point(aes(colour=freq, size = freq))
# change the color gradient from default to lightblue -> $white
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g

Derivation for Least Squares = Empirical Mean (Finding the Minimum)

  • Let \(X_i =\) regressor/predictor, and \(Y_i =\) outcome/result so we want to minimize the the squares: \[\sum_{i=1}^n (Y_i - \mu)^2\]
  • Proof is as follows \[ \begin{aligned} \sum_{i=1}^n (Y_i - \mu)^2 & = \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \Leftarrow \mbox{added} \pm \bar Y \mbox{which is adding 0 to the original equation}\\ (expanding~the~terms)& = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (Y_i - \bar Y), (\bar Y - \mu) \mbox{ are the terms}\\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (\bar Y - \mu) \mbox{ does not depend on } i \\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n \bar Y \mbox{ is equivalent to }n \bar Y\\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n Y_i - n \bar Y = 0 \mbox{ since } \sum_{i=1}^n Y_i = n \bar Y \\ \sum_{i=1}^n (Y_i - \mu)^2 & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \Leftarrow \sum_{i=1}^n (\bar Y - \mu)^2 \mbox{ is always} \geq 0 \mbox{ so we can take it out to form the inequality} \end{aligned} \]
    • because of the inequality above, to minimize the sum of the squares \(\sum_{i=1}^n (Y_i - \mu)^2\), \(\bar Y\) must be equal to \(\mu\)
  • An alternative approach to finding the minimum is taking the derivative with respect to \(\mu\) \[ \begin{aligned} \frac{d(\sum_{i=1}^n (Y_i - \mu)^2)}{d\mu} & = 0 \Leftarrow \mbox{setting this equal to 0 to find minimum}\\ -2\sum_{i=1}^n (Y_i - \mu) & = 0 \Leftarrow \mbox{divide by -2 on both sides and move } \mu \mbox{ term over to the right}\\ \sum_{i=1}^n Y_i & = \sum_{i=1}^n \mu \Leftarrow \mbox{for the two sums to be equal, all the terms must be equal}\\ Y_i & = \mu \\ \end{aligned} \]

Regression through the Origin