# Correlation and Regression

## Feb 6, 2018 00:00 · 6259 words · 30 minute read

*This is a note from the DataCamp Course on Correlation and Regression. All examples and materials come from DataCamp.*

Data analysis is about understanding relationships among variables. Exploring data with multiple variables requires new, more complex tools, but enables a richer set of comparisons. Here we learn how to describe relationships between two numerical quantities. We will characterize these relationships graphically, in the form of summary statistics, and through simple linear regression models.

# Visualizing two variables

## Bivariate relationships

- Both variables are numerical

- Response variable

- a.k.a. y, dependent

- Explanatory variable

- Something you think might be related to the response

- a.k.a. x, independent, predictor

## Scatterplots

Scatterplots are the most common and effective tools for visualizing the relationship between two numeric variables.

The `ncbirths`

dataset is a random sample of 1,000 cases taken from a larger dataset collected in 2004. Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).

```
# Load tidyvers
library(tidyverse)
# Get the database
library(openintro)
data(ncbirths)
glimpse(ncbirths)
```

```
## Observations: 1,000
## Variables: 13
## $ fage <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA,...
## $ mage <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16,...
## $ mature <fct> younger mom, younger mom, younger mom, younger ...
## $ weeks <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42,...
## $ premie <fct> full term, full term, full term, full term, ful...
## $ visits <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12...
## $ marital <fct> married, married, married, married, married, ma...
## $ gained <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34,...
## $ weight <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69,...
## $ lowbirthweight <fct> not low, not low, not low, not low, not low, lo...
## $ gender <fct> male, male, female, male, female, male, male, m...
## $ habit <fct> nonsmoker, nonsmoker, nonsmoker, nonsmoker, non...
## $ whitemom <fct> not white, not white, white, white, not white, ...
```

Using the `ncbirths`

dataset, make a scatterplot using `ggplot()`

to illustrate how the birth weight of these babies varies according to the number of weeks of gestation.

## Boxplots as discretized/conditioned scatterplots

If it is helpful, you can think of boxplots as scatterplots for which the variable on the x-axis has been discretized.

The `cut()`

function takes two arguments: the continuous variable you want to discretize and the number of `breaks`

that you want to make in that continuous variable in order to discretize it.

Using the `ncbirths`

dataset again, make a boxplot illustrating how the birth weight of these babies varies according to the number of weeks of gestation. This time, use the `cut()`

function to discretize the x-variable into six intervals (i.e. five breaks).

```
# Boxplot of weight vs. weeks
ggplot(data = ncbirths,
aes(x = cut(weeks, breaks = 5), y = weight)) +
geom_boxplot()
```

Note how the relationship no longer seems linear.

## Characterizing bivariate relationships

- Form (e.g. linear, quadratic, non-linear)

- Direction (e.g. positive, negative)

- Strength (how much sca!er/noise?)

- Outliers

## Creating scatterplots

Creating scatterplots is simple and they are so useful that is it worthwhile to expose yourself to many examples. Over time, you will gain familiarity with the types of patterns that you see. You will begin to recognize how scatterplots can reveal the nature of the relationship between two variables.

In this exercise, and throughout this chapter, we will be using several datasets listed below. These data are available through the openintro package. Briefly:

```
The `mammals` dataset contains information about 39 different species of mammals, including their body weight, brain weight, gestation time, and a few other variables.
The `mlbBat10` dataset contains batting statistics for 1,199 Major League Baseball players during the 2010 season.
The `bdims` dataset contains body girth and skeletal diameter measurements for 507 physically active individuals.
The `smoking` dataset contains information on the smoking habits of 1,691 citizens of the United Kingdom.
```

- Using the
`mammals`

dataset, create a scatterplot illustrating how the brain weight of a mammal varies as a function of its body weight.

```
ggplot(mammals, aes(y= BrainWt, x=BodyWt)) +
geom_point() +
ylab("Brain Weight") +
xlab("Body Weight")
```

- Using the
`mlbBat10`

dataset, create a scatterplot illustrating how the slugging percentage (`SLG`

) of a player varies as a function of his on-base percentage (`OBP`

).

```
ggplot(mlbBat10, aes(y=SLG, x=OBP)) +
geom_point() +
ylab("Slugging percentage") +
xlab("On-base percentage")
```

- Using the
`bdims`

dataset, create a scatterplot illustrating how a person’s weight varies as a function of their height. Use color to separate by sex, which you’ll need to coerce to a factor with`factor()`

.

```
ggplot(bdims,
aes(y=wgt, x=hgt, col=factor(sex, labels = c("Female", "Male")))) +
geom_point() +
ylab("Person's weight") +
xlab("Person's height")
```

- Using the
`smoking`

dataset, create a scatterplot illustrating how the amount that a person smokes on weekdays varies as a function of their age.

## Transformations

The relationship between two variables may not be linear. In these cases we can sometimes see strange and even inscrutable patterns in a scatterplot of the data. Sometimes there really is no meaningful relationship between the two variables. Other times, a careful transformation of one or both of the variables can reveal a clear relationship.

Recall the bizarre pattern that you saw in the scatterplot between brain weight and body weight among mammals in a previous exercise. Can we use transformations to clarify this relationship?

`ggplot2`

provides several different mechanisms for viewing transformed relationships. The `coord_trans()`

function transforms the coordinates of the plot. Alternatively, the `scale_x_log10()`

and `scale_y_log10()`

functions perform a base-10 log transformation of each axis. Note the differences in the appearance of the axes.

## Identifying outliers

In Chapter 5, we will discuss how outliers can affect the results of a linear regression model and how we can deal with them. For now, it is enough to simply identify them and note how the relationship between two variables may change as a result of removing outliers.

Recall that in the baseball example earlier in the chapter, most of the points were clustered in the lower left corner of the plot, making it difficult to see the general pattern of the majority of the data. This difficulty was caused by a few outlying players whose on-base percentages (OBPs) were exceptionally high. These values are present in our dataset only because these players had very few batting opportunities.

Both OBP and SLG are known as rate statistics, since they measure the frequency of certain events (as opposed to their count). In order to compare these rates sensibly, it makes sense to include only players with a reasonable number of opportunities, so that these observed rates have the chance to approach their long-run frequencies.

In Major League Baseball, batters qualify for the batting title only if they have 3.1 plate appearances per game. This translates into roughly 502 plate appearances in a 162-game season. The `mlbBat10`

dataset does not include plate appearances as a variable, but we can use at-bats (`AB`

) – which constitute a subset of plate appearances – as a proxy.

```
# Scatterplot of SLG vs. OBP
mlbBat10 %>%
filter(AB >= 200) %>%
ggplot(aes(x = OBP, y = SLG)) +
geom_point()
```

```
## name team position G AB R H 2B 3B HR RBI TB BB SO SB CS OBP
## 1 B Wood LAA 3B 81 226 20 33 2 0 4 14 47 6 71 1 0 0.174
## SLG AVG
## 1 0.208 0.146
```

# Correlation

## Computing correlation

The `cor(x, y)`

function will compute the Pearson product-moment correlation between variables, `x`

and `y`

. Since this quantity is symmetric with respect to x and y, it doesn’t matter in which order you put the variables.

At the same time, the `cor()`

function is very conservative when it encounters missing data (e.g. `NAs`

). The `use`

argument allows you to override the default behavior of returning `NA`

whenever any of the values encountered is `NA`

. Setting the `use`

argument to `"pairwise.complete.obs"`

allows `cor()`

to compute the correlation coefficient for those observations where the values of x and y are both not missing.

- Use
`cor()`

to compute the correlation between the birthweight of babies in the`ncbirths`

dataset and their mother’s age. There is no missing data in either variable.

```
## N r
## 1 1000 0.05506589
```

- Compute the correlation between the birthweight and the number of weeks of gestation for all non-missing pairs.

```
# Compute correlation for all non-missing pairs
ncbirths %>%
summarize(N = n(),
r = cor(weeks, weight, use = "pairwise.complete.obs"))
```

```
## N r
## 1 1000 0.6701013
```

## Exploring Anscombe

In 1973, Francis Anscombe famously created four datasets with remarkably similar numerical properties, but obviously different graphic relationships. The Anscombe dataset contains the x and y coordinates for these four datasets, along with a grouping variable, set, that distinguishes the quartet.

It may be helpful to remind yourself of the graphic relationship by viewing the four scatterplots:

```
# Compute properties of Anscombe
Anscombe %>%
group_by(set) %>%
summarize(N = n(), mean(x), sd(x), mean(y), sd(y), cor(x, y))
```

```
## # A tibble: 4 x 7
## set N `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 11 9.00 3.32 7.50 2.03 0.816
## 2 2 11 9.00 3.32 7.50 2.03 0.816
## 3 3 11 9.00 3.32 7.50 2.03 0.816
## 4 4 11 9.00 3.32 7.50 2.03 0.817
```

Note that all of the measures are identical (ignoring rounding error) across the four different sets.

## Perception of correlation

Estimating the value of the correlation coefficient between two quantities from their scatterplot can be tricky. Statisticians have shown that people’s perception of the strength of these relationships can be influenced by design choices like the x and y scales.

Nevertheless, with some practice your perception of correlation will improve. Toggle through the four scatterplots in the plotting window, each of which you’ve seen in a previous exercise. Jot down your best estimate of the value of the correlation coefficient between each pair of variables. Then, compare these values to the actual values you compute in this exercise.

Compute the correlation between…

- OBP and SLG for all players in the mlbBat10 dataset.

```
## N r
## 1 1199 0.8145628
```

- OBP and SLG for all players in the mlbBat10 dataset with at least 200 at-bats.

```
# Correlation for all players with at least 200 ABs
mlbBat10 %>%
filter(AB >= 200) %>%
summarize(N = n(), r = cor(OBP, SLG))
```

```
## N r
## 1 329 0.6855364
```

- Height and weight for each sex in the bdims dataset.

```
## # A tibble: 2 x 3
## sex N r
## <int> <int> <dbl>
## 1 0 260 0.431
## 2 1 247 0.535
```

- Body weight and brain weight for all species of mammals. Alongside this computation, compute the correlation between the same two quantities after taking their natural logarithms.

```
# Correlation among mammals, with and without log
mammals %>%
summarize(N = n(),
r = cor(BodyWt, BrainWt),
r_log = cor(log(BodyWt), log(BrainWt)))
```

```
## N r r_log
## 1 62 0.9341638 0.9595748
```

## Spurious correlation in random data

Statisticians must always be skeptical of potentially spurious correlations. Human beings are very good at seeing patterns in data, sometimes when the patterns themselves are actually just random noise. To illustrate how easy it can be to fall into this trap, we will look for patterns in truly random data.

The `noise`

dataset contains 20 sets of `x`

and `y`

variables drawn at random from a standard normal distribution. Each set, denoted as `z`

, has 50 observations of `x`

, `y`

pairs. Do you see any pairs of variables that might be meaningfully correlated? Are all of the correlation coefficients close to zero?

```
# Compute correlations for each dataset
noise_summary <- noise %>%
group_by(z) %>%
summarize(N = n(), spurious_cor = cor(x, y))
# Isolate sets with correlations above 0.2 in absolute strength
noise_summary %>%
filter(abs(spurious_cor) > 0.2)
```

```
## # A tibble: 3 x 3
## z N spurious_cor
## <int> <int> <dbl>
## 1 2 50 0.239
## 2 8 50 0.221
## 3 9 50 0.272
```

# Simple linear regression

## The “best fit” line

The simple linear regression model for a numeric response as a function of a numeric explanatory variable can be visualized on the corresponding scatterplot by a straight line. This is a “best fit” line that cuts through the data in a way that minimizes the distance between the line and the data points.

We might consider linear regression to be a specific example of a larger class of smooth models. The `geom_smooth()`

function allows you to draw such models over a scatterplot of the data itself. This technique is known as visualizing the model in the data space. The method argument to `geom_smooth()`

allows you to specify what class of smooth model you want to see. Since we are exploring linear models, we’ll set this argument to the value `"lm"`

.

Note that `geom_smooth()`

also takes an se argument that controls the standard error, which we will ignore for now.

## Fitting a linear model “by hand”

Recall the simple linear regression model:

\[Y = b_0 + b_1 \cdot X\]

Two facts enable you to compute the slope \(b_1\) and intercept \(b_0\) of a simple linear regression model from some basic summary statistics.

First, the slope can be defined as:

\[b_1 = r_{X,Y} \cdot \frac{s_Y}{s_X}\]

where \(r_{X,Y}\) represents the correlation (`cor()`

) of \(X\) and \(Y\) and \(s_X\) and \(s_Y\) represent the standard deviation (sd()) of \(X\) and \(Y\), respectively.

Second, the point \((\bar{x}, \bar{y})\) is always on the least squares regression line, where \(\bar{x}\) and \(\bar{y}\) denote the average of \(x\) and \(y\), respectively.

```
bdims_summary <- structure(list(N = 507L, r = 0.717301078724164, mean_hgt = 171.143786982249,
sd_hgt = 9.40720520351795, mean_wgt = 69.1475345167653, sd_wgt = 13.34576248554), class = "data.frame", row.names = c(NA,
-1L), .Names = c("N", "r", "mean_hgt", "sd_hgt", "mean_wgt",
"sd_wgt"))
# Print bdims_summary
bdims_summary
```

```
## N r mean_hgt sd_hgt mean_wgt sd_wgt
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576
```

```
# Add slope and intercept
bdims_summary %>%
mutate(slope = r * sd_wgt / sd_hgt,
intercept = mean_wgt - slope * mean_hgt)
```

```
## N r mean_hgt sd_hgt mean_wgt sd_wgt slope intercept
## 1 507 0.7173011 171.1438 9.407205 69.14753 13.34576 1.017617 -105.0113
```

## Regression to the mean

Regression to the mean is a concept attributed to Sir Francis Galton. The basic idea is that extreme random observations will tend to be less extreme upon a second trial. This is simply due to chance alone. While “regression to the mean” and “linear regression” are not the same thing, we will examine them together in this exercise.

One way to see the effects of regression to the mean is to compare the heights of parents to their children’s heights. While it is true that tall mothers and fathers tend to have tall children, those children tend to be less tall than their parents, relative to average. That is, fathers who are 3 inches taller than the average father tend to have children who may be taller than average, but by less than 3 inches.

The `Galton_men`

and `Galton_women`

datasets contain data originally collected by Galton himself in the 1880s on the heights of men and women, respectively, along with their parents’ heights.

Compare the slope of the regression line to the slope of the diagonal line. What does this tell you?

```
# Create dataset as the one from DataCamp
# install.packages("HistData")
library(HistData)
library(dplyr)
data(GaltonFamilies)
Galton_men <- GaltonFamilies %>%
filter(gender == "male") %>%
rename(sex=gender, height=childHeight, nkids=children)
Galton_women <- GaltonFamilies %>%
filter(gender == "female") %>%
rename(sex=gender, height=childHeight, nkids=children)
```

```
# Height of children vs. height of father
ggplot(data = Galton_men, aes(x = father, y = height)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
```

```
# Height of children vs. height of mother
ggplot(data = Galton_women, aes(x = mother, y = height)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
geom_smooth(method = "lm", se = FALSE)
```

Because the slope of the regression line is smaller than 1 (the slope of the diagonal line) for both males and females, we can verify Sir Francis Galton’s regression to the mean concept!

In an opinion piece about nepotism published in The New York Times in 2015, economist Seth Stephens-Davidowitz wrote that:

“Regression to the mean is so powerful that once-in-a-generation talent basically never sires once-in-a-generation talent. It explains why Michael Jordan’s sons were middling college basketball players and Jakob Dylan wrote two good songs. It is why there are no American parent-child pairs among Hall of Fame players in any major professional sports league.”

# Interpreting regression models

## Fitting simple linear models

While the `geom_smooth(method = "lm")`

function is useful for drawing linear models on a scatterplot, it doesn’t actually return the characteristics of the model. As suggested by that syntax, however, the function that creates linear models is `lm()`

. This function generally takes two arguments:

- A formula that specifies the model

- A data argument for the data frame that contains the data you want to use to fit the model

The `lm()`

function returns a model object having class `"lm"`

. This object contains lots of information about your regression model, including the data used to fit the model, the specification of the model, the fitted values and residuals, etc.

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Coefficients:
## (Intercept) hgt
## -105.011 1.018
```

```
##
## Call:
## lm(formula = SLG ~ OBP, data = mlbBat10)
##
## Coefficients:
## (Intercept) OBP
## 0.009407 1.110323
```

```
# Log-linear model for body weight as a function of brain weight
lm(log(BodyWt) ~ log(BrainWt), data=mammals)
```

```
##
## Call:
## lm(formula = log(BodyWt) ~ log(BrainWt), data = mammals)
##
## Coefficients:
## (Intercept) log(BrainWt)
## -2.509 1.225
```

## The lm summary output

An `"lm"`

object contains a host of information about the regression model that you fit. There are various ways of extracting different pieces of information.

The `coef()`

function displays only the values of the coefficients. Conversely, the `summary()`

function displays not only that information, but a bunch of other information, including the associated standard error and p-value for each coefficient, the \(R^2\), adjusted \(R^2\), and the residual standard error. The summary of an `"lm"`

object in R is very similar to the output you would see in other statistical computing environments (e.g. Stata, SPSS, etc.)

We have already created the `mod`

object, a linear model for the weight of individuals as a function of their height, using the `bdims`

dataset and the code

`mod <- lm(wgt ~ hgt, data = bdims)`

```
## (Intercept) hgt
## -105.011254 1.017617
```

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
```

## Fitted values and residuals

Once you have fit a regression model, you are often interested in the fitted values (\(\hat{y}_i\)) and the residuals (\(e_i\)), where \(i\) indexes the observations. Recall that:

\[e_i = y_i - \hat{y}_i\]

The least squares fitting procedure guarantees that the mean of the residuals is zero (n.b., numerical instability may result in the computed values not being exactly zero). At the same time, the mean of the fitted values must equal the mean of the response variable.

In this exercise, we will confirm these two mathematical facts by accessing the fitted values and residuals with the `fitted.values()`

and `residuals()`

functions, respectively, for the following model:

`mod <- lm(wgt ~ hgt, data = bdims)`

`## [1] TRUE`

`## [1] -1.266971e-15`

## Tidying your linear model

As you fit a regression model, there are some quantities (e.g. \(R^2\)) that apply to the model as a whole, while others apply to each observation (e.g. \(\hat{y}_i\)). If there are several of these per-observation quantities, it is sometimes convenient to attach them to the original data as new variables.

The `augment()`

function from the `broom`

package does exactly this. It takes a model object as an argument and returns a data frame that contains the data on which the model was fit, along with several quantities specific to the regression model, including the fitted values, residuals, leverage scores, and standardized residuals.

```
# Load broom
library(broom)
# Create bdims_tidy
bdims_tidy <- augment(mod)
# Glimpse the resulting data frame
glimpse(bdims_tidy)
```

```
## Observations: 507
## Variables: 9
## $ wgt <dbl> 65.6, 71.8, 80.7, 72.6, 78.8, 74.8, 86.4, 78.4, 62....
## $ hgt <dbl> 174.0, 175.3, 193.5, 186.5, 187.2, 181.5, 184.0, 18...
## $ .fitted <dbl> 72.05406, 73.37697, 91.89759, 84.77427, 85.48661, 7...
## $ .se.fit <dbl> 0.4320546, 0.4520060, 1.0667332, 0.7919264, 0.81834...
## $ .resid <dbl> -6.4540648, -1.5769666, -11.1975919, -12.1742745, -...
## $ .hat <dbl> 0.002154570, 0.002358152, 0.013133942, 0.007238576,...
## $ .sigma <dbl> 9.312824, 9.317005, 9.303732, 9.301360, 9.312471, 9...
## $ .cooksd <dbl> 5.201807e-04, 3.400330e-05, 9.758463e-03, 6.282074e...
## $ .std.resid <dbl> -0.69413418, -0.16961994, -1.21098084, -1.31269063,...
```

## Making predictions

The `fitted.values()`

function or the `augment()`

-ed data frame provides us with the fitted values for the observations that were in the original data. However, once we have fit the model, we may want to compute expected values for observations that were not present in the data on which the model was fit. These types of predictions are called out-of-sample.

The `ben`

data frame contains a height and weight observation for one person. The `mod`

object contains the fitted model for weight as a function of height for the observations in the `bdims`

dataset. We can use the `predict()`

function to generate expected values for the weight of new individuals. We must pass the data frame of new observations through the `newdata`

argument.

```
# Generate ben
ben <- structure(list(wgt = 74.8, hgt = 182.8), .Names = c("wgt", "hgt"
), row.names = c(NA, -1L), class = "data.frame")
# Print ben
ben
```

```
## wgt hgt
## 1 74.8 182.8
```

```
## 1
## 81.00909
```

Note that the data frame `ben`

has variables with the exact same names as those in the fitted model.

## Adding a regression line to a plot manually

The `geom_smooth()`

function makes it easy to add a simple linear regression line to a scatterplot of the corresponding variables. And in fact, there are more complicated regression models that can be visualized in the data space with `geom_smooth()`

. However, there may still be times when we will want to add regression lines to our scatterplot manually. To do this, we will use the `geom_abline()`

function, which takes slope and intercept arguments. Naturally, we have to compute those values ahead of time, but we already saw how to do this (e.g. using `coef()`

).

The `coefs`

data frame contains the model estimates retrieved from `coef()`

. Passing this to `geom_abline()`

as the data argument will enable you to draw a straight line on your scatterplot.

```
# Make coefs df
coefs <- structure(list(`(Intercept)` = -105.011254168143, hgt = 1.01761677567046), .Names = c("(Intercept)",
"hgt"), row.names = c(NA, -1L), class = "data.frame")
# Add the line to the scatterplot
ggplot(data = bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(data = coefs,
aes(intercept = `(Intercept)`, slope = hgt),
color = "dodgerblue")
```

# Model Fit

## Standard error of residuals

One way to assess strength of fit is to consider how far off the model is for a typical case. That is, for some observations, the fitted value will be very close to the actual value, while for others it will not. The magnitude of a typical residual can give us a sense of generally how close our estimates are.

Consider following examples:

```
ggplot(data = textbooks, aes(x = amazNew, y = uclaNew)) +
geom_point() + geom_smooth(method = "lm", se = FALSE)
```

```
ggplot(data = possum, aes(y = totalL, x = tailL)) +
geom_point() + geom_smooth(method = "lm", se = FALSE)
```

Some of the residuals are positive, while others are negative. In fact, it is guaranteed by the least squares fitting procedure that the mean of the residuals is zero. Thus, it makes more sense to compute the square root of the mean squared residual, or *root mean squared error* (\(RMSE\)). R calls this quantity the *residual standard error*.

```
library(broom)
mod_possum <- lm(totalL ~ tailL, data = possum)
mod_possum %>%
augment() %>%
summarize(SSE = sum(.resid^2),
SSE_also = (n() - 1) * var(.resid))
```

```
## SSE SSE_also
## 1 1301.488 1301.488
```

To make this estimate unbiased, you have to divide the sum of the squared residuals by the degrees of freedom in the model. Thus,

\[RMSE= \sqrt{ \frac{\sum_i{e_i^2}}{d.f.} } = \sqrt{ \frac{SSE}{d.f.} }\]

In fact, \(RMSE\) is availble when a call to `summary()`

is made:

You can recover the residuals from mod with `residuals()`

, and the degrees of freedom with `df.residual()`

.

Going back `mod`

:

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
```

`## [1] -1.266971e-15`

`## [1] 9.30804`

## Assessing simple linear model fit

Recall that the coefficient of determination (\(R^2\)), can be computed as \[R^2= 1 - \frac{SSE}{SST} = 1 - \frac{Var(e)}{Var(y)} \,\] where \(e\) is the vector of residuals and \(y\) is the response variable. This gives us the interpretation of \(R^2\) as the percentage of the variability in the response that is explained by the model, since the residuals are the part of that variability that remains unexplained by the model.

```
##
## Call:
## lm(formula = wgt ~ hgt, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.743 -6.402 -1.231 5.059 41.103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105.01125 7.53941 -13.93 <2e-16 ***
## hgt 1.01762 0.04399 23.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.308 on 505 degrees of freedom
## Multiple R-squared: 0.5145, Adjusted R-squared: 0.5136
## F-statistic: 535.2 on 1 and 505 DF, p-value: < 2.2e-16
```

```
# Compute R-squared
bdims_tidy %>%
summarize(var_y = var(wgt), var_e = var(residuals(mod))) %>%
mutate(R_squared = 1- (var_e/var_y))
```

```
## var_y var_e R_squared
## 1 178.1094 86.46839 0.5145208
```

This means that 51.4% of the variability in weight is explained by height.

## Linear vs. average

The \(R^2\) gives us a numerical measurement of the strength of fit relative to a null model based on the average of the response variable: \[\hat{y}_{null} = \bar{y}\]

This model has an \(R^2\) of zero because \(SSE = SST\). That is, since the fitted values (\(\hat{y}_{null}\)) are all equal to the average (\(\bar{y}\)), the residual for each observation is the distance between that observation and the mean of the response. Since we can always fit the null model, it serves as a baseline against which all other models will be compared.

In the graphic, we visualize the residuals for the null model (`mod_null`

at left) vs. the simple linear regression model (`mod_hgt`

at right) with height as a single explanatory variable. Try to convince yourself that, if you squared the lengths of the grey arrows on the left and summed them up, you would get a larger value than if you performed the same operation on the grey arrows on the right.

## Leverage

The leverage of an observation in a regression model is defined entirely in terms of the distance of that observation from the mean of the explanatory variable. That is, observations close to the mean of the explanatory variable have low leverage, while observations far from the mean of the explanatory variable have high leverage. Points of high leverage may or may not be influential.

The `augment()`

function from the `broom`

package will add the leverage scores (`.hat`

) to a model data frame.

```
mod <- lm(formula = SLG ~ OBP, data = filter(mlbBat10, AB >= 10))
ggplot(data = filter(mlbBat10, AB >= 10),
aes (x = OBP, y = SLG)) +
geom_point() +
geom_smooth(method = "lm", se=0)
```

```
## SLG OBP .fitted .se.fit .resid .hat .sigma
## 1 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 2 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 3 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 4 0.308 0.550 0.69049108 0.009158810 -0.38249108 0.01641049 0.07011360
## 5 0.000 0.037 0.01152451 0.008770891 -0.01152451 0.01504981 0.07154283
## 6 0.038 0.038 0.01284803 0.008739031 0.02515197 0.01494067 0.07153800
## .cooksd .std.resid
## 1 0.0027664282 0.5289049
## 2 0.0027664282 0.5289049
## 3 0.0027664282 0.5289049
## 4 0.2427446800 -5.3943121
## 5 0.0002015398 -0.1624191
## 6 0.0009528017 0.3544561
```

## Influence

As noted previously, observations of high leverage may or may not be *influential*. The influence of an observation depends not only on its leverage, but also on the magnitude of its residual. Recall that while leverage only takes into account the explanatory variable (\(x\)), the residual depends on the response variable (\(y\)) and the fitted value (\(\hat{y}\)).

Influential points are likely to have high leverage and deviate from the general relationship between the two variables. We measure influence using Cook’s distance, which incorporates both the leverage and residual of each observation.

```
## SLG OBP .fitted .se.fit .resid .hat .sigma
## 1 0.308 0.550 0.69049108 0.009158810 -0.3824911 0.016410487 0.07011360
## 2 0.833 0.385 0.47211002 0.004190644 0.3608900 0.003435619 0.07028875
## 3 0.800 0.455 0.56475653 0.006186785 0.2352435 0.007488132 0.07101125
## 4 0.379 0.133 0.13858258 0.005792344 0.2404174 0.006563752 0.07098798
## 5 0.786 0.438 0.54225666 0.005678026 0.2437433 0.006307223 0.07097257
## 6 0.231 0.077 0.06446537 0.007506974 0.1665346 0.011024863 0.07127661
## .cooksd .std.resid
## 1 0.24274468 -5.394312
## 2 0.04407145 5.056428
## 3 0.04114818 3.302718
## 4 0.03760256 3.373787
## 5 0.03712042 3.420018
## 6 0.03057912 2.342252
```

## Removing outliers

Observations can be outliers for a number of different reasons. Statisticians must always be careful—and more importantly, transparent—when dealing with outliers. Sometimes, a better model fit can be achieved by simply removing outliers and re-fitting the model. However, one must have strong justification for doing this. A desire to have a higher R2R^2 is not a good enough reason!

In the `mlbBat10`

data, the outlier with an OBP of 0.550 is Bobby Scales, an infielder who had four hits in 13 at-bats for the Chicago Cubs. Scales also walked seven times, resulting in his unusually high OBP. The justification for removing Scales here is weak. While his performance was unusual, there is nothing to suggest that it is not a valid data point, nor is there a good reason to think that somehow we will learn more about Major League Baseball players by excluding him.

Nevertheless, we can demonstrate how removing him will affect our model.

```
# Create nontrivial_players
nontrivial_players <- filter(mlbBat10, AB>9, OBP<0.5 )
# Fit model to new data
mod_cleaner <- lm(SLG ~ OBP, data = nontrivial_players)
# View model summary
summary(mod_cleaner)
```

```
##
## Call:
## lm(formula = SLG ~ OBP, data = nontrivial_players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31383 -0.04165 -0.00261 0.03992 0.35819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043326 0.009823 -4.411 1.18e-05 ***
## OBP 1.345816 0.033012 40.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07011 on 734 degrees of freedom
## Multiple R-squared: 0.6937, Adjusted R-squared: 0.6932
## F-statistic: 1662 on 1 and 734 DF, p-value: < 2.2e-16
```

## High leverage points

Not all points of high leverage are influential. While the high leverage observation corresponding to Bobby Scales in the previous exercise is influential, the three observations for players with OBP and SLG values of 0 are not influential.

This is because they happen to lie right near the regression anyway. Thus, while their extremely low OBP gives them the power to exert influence over the slope of the regression line, their low SLG prevents them from using it.

```
## SLG OBP .fitted .se.fit .resid .hat .sigma
## 1 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 2 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 3 0.000 0.000 -0.03744579 0.009956861 0.03744579 0.01939493 0.07153050
## 4 0.308 0.550 0.69049108 0.009158810 -0.38249108 0.01641049 0.07011360
## 5 0.000 0.037 0.01152451 0.008770891 -0.01152451 0.01504981 0.07154283
## 6 0.038 0.038 0.01284803 0.008739031 0.02515197 0.01494067 0.07153800
## .cooksd .std.resid
## 1 0.0027664282 0.5289049
## 2 0.0027664282 0.5289049
## 3 0.0027664282 0.5289049
## 4 0.2427446800 -5.3943121
## 5 0.0002015398 -0.1624191
## 6 0.0009528017 0.3544561
```