- Introduction to Regression
- Notation
- Dalton’s Data and Least Squares
- Regression through the Origin
- Finding the Best Fit Line (Ordinary Least Squares)
- Regression to the Mean
- Statistical Linear Regression Models
- Residuals
- Inference in Regression
- Multivariate Regression
- Dummy Variables
- Interactions
- Multivariable Simulation
- Residuals and Diagnostics
- Model Selection
- Rumsfeldian Triplet
- General Rules
- Example - \(R^2\) v \(n\)
- Adjusted \(R^2\)
- Example - Unrelated Regressors
- Example - Highly Correlated Regressors / Variance Inflation
- Example: Variance Inflation Factors
- Residual Variance Estimates
- Covariate Model Selection
- Example: ANOVA
- Example: Step-wise Model Search

- General Linear Models Overview
- General Linear Models - Binary Models
- General Linear Models - Poisson Models
- Fitting Functions

- linear regression/linear models \(\rightarrow\) go to procedure to analyze data
*Francis Galton*invented the term and concepts of regression and correlation- he predicted child’s height from parents height

- questions that regression can help answer
- prediction of one thing from another
- find simple, interpretable, meaningful model to predict the data
- quantify and investigate variations that are unexplained or unrelated to the predictor \(\rightarrow\)
**residual variation** - quantify the effects of other factors may have on the outcome
- assumptions to generalize findings beyond data we have \(\rightarrow\)
**statistical inference** **regression to the mean**(see below)

- regular letters (i.e. \(X\), \(Y\)) = generally used to denote
**observed**variables - Greek letters (i.e. \(\mu\), \(\sigma\)) = generally used to denote
**unknown**variables that we are trying to estimate - \(X_1, X_2, \ldots, X_n\) describes \(n\) data points
- \(\bar X\), \(\bar Y\) = observed means for random variables \(X\) and \(Y\)
- \(\hat \beta_0\), \(\hat \beta_1\) = estimators for true values of \(\beta_0\) and \(\beta_1\)

**empirical mean**is defined as \[\bar X = \frac{1}{n}\sum_{i=1}^n X_i\]**centering**the random variable is defined as \[\tilde X_i = X_i - \bar X\]- mean of \(\tilde X_i\) = 0

**empirical variance**is defined as \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 = \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right) \Leftarrow \mbox{shortcut for calculation}\]**empirical standard deviation**is defined as \(S = \sqrt{S^2}\)- average squared distances between the observations and the mean
- has same units as the data

**scaling**the random variables is defined as \(X_i / S\)- standard deviation of \(X_i / S\) = 1

**normalizing**the data/random variable is defined \[Z_i = \frac{X_i - \bar X}{s}\]- empirical mean = 0, empirical standard deviation = 1
- distribution centered around 0 and data have units = # of standard deviations away from the original mean
: \(Z_1 = 2\) means that the data point is 2 standard deviations larger than the original mean*example*

- normalization makes non-comparable data
**comparable**

- Let \((X_i, Y_i)\) = pairs of data
**empirical covariance**is defined as \[ Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y) = \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right) \]- has units of \(X \times\) units of \(Y\)

**correlation**is defined as \[Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}\] where \(S_x\) and \(S_y\) are the estimates of standard deviations for the \(X\) observations and \(Y\) observations, respectively- the value is effectively the covariance standardized into a unit-less quantity
- \(Cor(X, Y) = Cor(Y, X)\)
- \(-1 \leq Cor(X, Y) \leq 1\)
- \(Cor(X,Y) = 1\) and \(Cor(X, Y) = -1\) only when the \(X\) or \(Y\) observations fall perfectly on a positive or negative sloped line, respectively
- \(Cor(X, Y)\) measures the strength of the linear relationship between the \(X\) and \(Y\) data, with stronger relationships as \(Cor(X,Y)\) heads towards -1 or 1
- \(Cor(X, Y) = 0\) implies no linear relationship

- collected data from 1885 in
`UsingR`

package - predicting children’s heights from parents’ height
- observations from the marginal/individual parent/children distributions
- looking only at the children’s dataset to find the best predictor
- “middle” of children’s dataset \(\rightarrow\) best predictor
- “middle” \(\rightarrow\) center of mass \(\rightarrow\) mean of the dataset
- Let \(Y_i\) = height of child \(i\) for \(i = 1, \ldots, n = 928\), the “middle” = \(\mu\) such that \[\sum_{i=1}^n (Y_i - \mu)^2\]
- \(\mu = \bar Y\) for the above sum to be the smallest \(\rightarrow\)
**least squares = empirical mean**

**Note**:`manipulate`

function can help to show this

```
# load necessary packages/install if needed
library(ggplot2); library(UsingR); data(galton)
# function to plot the histograms
myHist <- function(mu){
# calculate the mean squares
mse <- mean((galton$child - mu)^2)
# plot histogram
g <- ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon",
colour = "black", binwidth=1)
# add vertical line marking the center value mu
g <- g + geom_vline(xintercept = mu, size = 2)
g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = ""))
g
}
# manipulate allows the user to change the variable mu to see how the mean squares changes
# library(manipulate); manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))]
# plot the correct graph
myHist(mean(galton$child))
```

in order to visualize the parent-child height relationship, a scatter plot can be used

**Note**: because there are multiple data points for the same parent/child combination, a third dimension (size of point) should be used when constructing the scatter plot

```
library(dplyr)
# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
# filter to only meaningful combinations
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 20), guide = "none" )
# plot grey circles slightly larger than data as base (achieve an outline effect)
g <- g + geom_point(colour="grey50", aes(size = freq+10, show_guide = FALSE))
# plot the accurate data points
g <- g + geom_point(aes(colour=freq, size = freq))
# change the color gradient from default to lightblue -> $white
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g
```

- Let \(X_i =\)
**regressor**/**predictor**, and \(Y_i =\)**outcome**/**result**so we want to minimize the the squares: \[\sum_{i=1}^n (Y_i - \mu)^2\] - Proof is as follows \[
\begin{aligned}
\sum_{i=1}^n (Y_i - \mu)^2 & = \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \Leftarrow \mbox{added} \pm \bar Y \mbox{which is adding 0 to the original equation}\\
(expanding~the~terms)& = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (Y_i - \bar Y), (\bar Y - \mu) \mbox{ are the terms}\\
(simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (\bar Y - \mu) \mbox{ does not depend on } i \\
(simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n \bar Y \mbox{ is equivalent to }n \bar Y\\
(simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n Y_i - n \bar Y = 0 \mbox{ since } \sum_{i=1}^n Y_i = n \bar Y \\
\sum_{i=1}^n (Y_i - \mu)^2 & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \Leftarrow \sum_{i=1}^n (\bar Y - \mu)^2 \mbox{ is always} \geq 0 \mbox{ so we can take it out to form the inequality}
\end{aligned}
\]
- because of the inequality above, to minimize the sum of the squares \(\sum_{i=1}^n (Y_i - \mu)^2\),
**\(\bar Y\) must be equal to \(\mu\)**

- because of the inequality above, to minimize the sum of the squares \(\sum_{i=1}^n (Y_i - \mu)^2\),
- An alternative approach to finding the minimum is taking the
with respect to \(\mu\) \[ \begin{aligned} \frac{d(\sum_{i=1}^n (Y_i - \mu)^2)}{d\mu} & = 0 \Leftarrow \mbox{setting this equal to 0 to find minimum}\\ -2\sum_{i=1}^n (Y_i - \mu) & = 0 \Leftarrow \mbox{divide by -2 on both sides and move } \mu \mbox{ term over to the right}\\ \sum_{i=1}^n Y_i & = \sum_{i=1}^n \mu \Leftarrow \mbox{for the two sums to be equal, all the terms must be equal}\\ Y_i & = \mu \\ \end{aligned} \]*derivative*

- Let \(X_i =\) parents’ heights (
**regressor**) and \(Y_i =\) children’s heights (**outcome**) - find a line with slope \(\beta\) that passes through the origin at (0,0) \[Y_i = X_i \beta\] such that it minimizes \[\sum_{i=1}^n (Y_i - X_i \beta)^2\]
**Note**: it is generally a bad practice forcing the line through (0, 0)