## Introduction to Regression

• linear regression/linear models $$\rightarrow$$ go to procedure to analyze data
• Francis Galton invented the term and concepts of regression and correlation
• he predicted child’s height from parents height
• questions that regression can help answer
• prediction of one thing from another
• find simple, interpretable, meaningful model to predict the data
• quantify and investigate variations that are unexplained or unrelated to the predictor $$\rightarrow$$ residual variation
• quantify the effects of other factors may have on the outcome
• assumptions to generalize findings beyond data we have $$\rightarrow$$ statistical inference
• regression to the mean (see below)

## Notation

• regular letters (i.e. $$X$$, $$Y$$) = generally used to denote observed variables
• Greek letters (i.e. $$\mu$$, $$\sigma$$) = generally used to denote unknown variables that we are trying to estimate
• $$X_1, X_2, \ldots, X_n$$ describes $$n$$ data points
• $$\bar X$$, $$\bar Y$$ = observed means for random variables $$X$$ and $$Y$$
• $$\hat \beta_0$$, $$\hat \beta_1$$ = estimators for true values of $$\beta_0$$ and $$\beta_1$$

### Empirical/Sample Mean

• empirical mean is defined as $\bar X = \frac{1}{n}\sum_{i=1}^n X_i$
• centering the random variable is defined as $\tilde X_i = X_i - \bar X$
• mean of $$\tilde X_i$$ = 0

### Empirical/Sample Standard Deviation & Variance

• empirical variance is defined as $S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 = \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right) \Leftarrow \mbox{shortcut for calculation}$
• empirical standard deviation is defined as $$S = \sqrt{S^2}$$
• average squared distances between the observations and the mean
• has same units as the data
• scaling the random variables is defined as $$X_i / S$$
• standard deviation of $$X_i / S$$ = 1

### Normalization

• normalizing the data/random variable is defined $Z_i = \frac{X_i - \bar X}{s}$
• empirical mean = 0, empirical standard deviation = 1
• distribution centered around 0 and data have units = # of standard deviations away from the original mean
• example: $$Z_1 = 2$$ means that the data point is 2 standard deviations larger than the original mean
• normalization makes non-comparable data comparable

### Empirical Covariance & Correlation

• Let $$(X_i, Y_i)$$ = pairs of data
• empirical covariance is defined as $Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y) = \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right)$
• has units of $$X \times$$ units of $$Y$$
• correlation is defined as $Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}$ where $$S_x$$ and $$S_y$$ are the estimates of standard deviations for the $$X$$ observations and $$Y$$ observations, respectively
• the value is effectively the covariance standardized into a unit-less quantity
• $$Cor(X, Y) = Cor(Y, X)$$
• $$-1 \leq Cor(X, Y) \leq 1$$
• $$Cor(X,Y) = 1$$ and $$Cor(X, Y) = -1$$ only when the $$X$$ or $$Y$$ observations fall perfectly on a positive or negative sloped line, respectively
• $$Cor(X, Y)$$ measures the strength of the linear relationship between the $$X$$ and $$Y$$ data, with stronger relationships as $$Cor(X,Y)$$ heads towards -1 or 1
• $$Cor(X, Y) = 0$$ implies no linear relationship

## Dalton’s Data and Least Squares

• collected data from 1885 in UsingR package
• predicting children’s heights from parents’ height
• observations from the marginal/individual parent/children distributions
• looking only at the children’s dataset to find the best predictor
• “middle” of children’s dataset $$\rightarrow$$ best predictor
• “middle” $$\rightarrow$$ center of mass $$\rightarrow$$ mean of the dataset
• Let $$Y_i$$ = height of child $$i$$ for $$i = 1, \ldots, n = 928$$, the “middle” = $$\mu$$ such that $\sum_{i=1}^n (Y_i - \mu)^2$
• $$\mu = \bar Y$$ for the above sum to be the smallest $$\rightarrow$$ least squares = empirical mean
• Note: manipulate function can help to show this
# load necessary packages/install if needed
library(ggplot2); library(UsingR); data(galton)
# function to plot the histograms
myHist <- function(mu){
# calculate the mean squares
mse <- mean((galton$child - mu)^2) # plot histogram g <- ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon", colour = "black", binwidth=1) # add vertical line marking the center value mu g <- g + geom_vline(xintercept = mu, size = 2) g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = "")) g } # manipulate allows the user to change the variable mu to see how the mean squares changes # library(manipulate); manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))] # plot the correct graph myHist(mean(galton$child))

• in order to visualize the parent-child height relationship, a scatter plot can be used

• Note: because there are multiple data points for the same parent/child combination, a third dimension (size of point) should be used when constructing the scatter plot

library(dplyr)
# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
# filter to only meaningful combinations
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 20), guide = "none" )
# plot grey circles slightly larger than data as base (achieve an outline effect)
g <- g + geom_point(colour="grey50", aes(size = freq+10, show_guide = FALSE))
# plot the accurate data points
g <- g + geom_point(aes(colour=freq, size = freq))
# change the color gradient from default to lightblue -> \$white
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g

### Derivation for Least Squares = Empirical Mean (Finding the Minimum)

• Let $$X_i =$$ regressor/predictor, and $$Y_i =$$ outcome/result so we want to minimize the the squares: $\sum_{i=1}^n (Y_i - \mu)^2$
• Proof is as follows \begin{aligned} \sum_{i=1}^n (Y_i - \mu)^2 & = \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \Leftarrow \mbox{added} \pm \bar Y \mbox{which is adding 0 to the original equation}\\ (expanding~the~terms)& = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (Y_i - \bar Y), (\bar Y - \mu) \mbox{ are the terms}\\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (\bar Y - \mu) \mbox{ does not depend on } i \\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n \bar Y \mbox{ is equivalent to }n \bar Y\\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n Y_i - n \bar Y = 0 \mbox{ since } \sum_{i=1}^n Y_i = n \bar Y \\ \sum_{i=1}^n (Y_i - \mu)^2 & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \Leftarrow \sum_{i=1}^n (\bar Y - \mu)^2 \mbox{ is always} \geq 0 \mbox{ so we can take it out to form the inequality} \end{aligned}
• because of the inequality above, to minimize the sum of the squares $$\sum_{i=1}^n (Y_i - \mu)^2$$, $$\bar Y$$ must be equal to $$\mu$$
• An alternative approach to finding the minimum is taking the derivative with respect to $$\mu$$ \begin{aligned} \frac{d(\sum_{i=1}^n (Y_i - \mu)^2)}{d\mu} & = 0 \Leftarrow \mbox{setting this equal to 0 to find minimum}\\ -2\sum_{i=1}^n (Y_i - \mu) & = 0 \Leftarrow \mbox{divide by -2 on both sides and move } \mu \mbox{ term over to the right}\\ \sum_{i=1}^n Y_i & = \sum_{i=1}^n \mu \Leftarrow \mbox{for the two sums to be equal, all the terms must be equal}\\ Y_i & = \mu \\ \end{aligned}

## Regression through the Origin

• Let $$X_i =$$ parents’ heights (regressor) and $$Y_i =$$ children’s heights (outcome)
• find a line with slope $$\beta$$ that passes through the origin at (0,0) $Y_i = X_i \beta$ such that it minimizes $\sum_{i=1}^n (Y_i - X_i \beta)^2$
• Note: it is generally a bad practice forcing the line through (0, 0)