Understanding lm() output

Scott Creel
31 Aug 14

Conservation Biology
BIOE 440R & BIOE 521

You must be online to view the equations in this presentation

Ordinary Least Squares (OLS) Regression

Regression is a tool to test for a relationship between continous variables. Starting with a straight-line relationship between two variables:

\[ \widehat{Y_{i}} = B_{0} + B_{1}*X_{i} \]

\[ Y_{i} = \widehat{Y_{i}} + \epsilon_{i} \]

\[ Y_{i} = B_{0} + B_{1}*X_{i} +\epsilon_{i} \]

OLS Regression

The estimated regression coefficients are

y-intercept = \( B_{0} \) or \( \widehat{\beta_{0} } \),
slope = \( B_{1} \) or \( \widehat{\beta_{1} } \)

for the best fitting line that relates Y to X.

\( \beta_{0} \) and \( \beta_{1} \) are the true intercept and slope for the entire population.

\( B_{0} \) and \( B_{1} \) are the estimated intercept and slope from a sample of that population.

OLS Regression

The least squares estimates of the regression coefficients yield

\[ \min(\sum_{i=1}^{N}\epsilon_i^{2}) \]

That is, with the OLS estimates of \( \widehat{\beta(0)} \) and \( \widehat{\beta(1)} \) the sum of the squared residuals, \( \epsilon_{i}^2 \) is as small as possible.

Understanding Residuals

plot of chunk unnamed-chunk-1

For each point, the residual error ('residual') \( \epsilon_{i} \) is the difference between the home range size predicted by the regression and the actual home range size observed.

If all of the points fell right on the line, \( \sum_{i=1}^{N}\epsilon_i^{2} \) would be zero.

A worked example with R code.

First, input some simple data with two continuous variables.

packsize <- c(9,8,10,6,13,10,6,15,18,17)
#pack size as number of adults
homerange <- c(25,20,28,36,22,46,52,41,59,59)  
#home range size in km2

Hypothesize that home range size depends on pack size.

Plot the data for an initial evaluation

plot(y = homerange, x = packsize, xlab = "Pack Size (adults)", ylab = "Home Range (km2)", 
     col = 'red', pch = 19, cex = 2.5, cex.axis = 1.3, cex.lab = 1.3)

# col = 'red' set the point coloer, pch sets the point type
# cex sets the point size, cex.axis and cex.lab set the size of axis tickmarks and labels

(the plot from this code is on the next slide)

plot of chunk unnamed-chunk-4

There is a positive relationship, but is it strong enough to be meaningful?

OLS Regression tests how likely it is to get data with the observed relationship, if pack size does not actually affect home range size in a causal manner.

Formally, the OLS regression tests the hypothesis:

\[ H_{0}: \beta_{1} = 0 \]

\[ H_{A}: \beta_{1} \neq 0 \]

Using a t-test:

\[ t = \frac{B_{1}}{SE_{B_{1}}} \]

Fit a linear OLS regression to the data and display the coefficients:

mod1 <- lm(homerange ~ packsize)
mod1


Call:
lm(formula = homerange ~ packsize)

Coefficients:
(Intercept)     packsize  
      20.75         1.61

homerange = 20.75 + 1.61 * pack size

Is an increase of 1.61 km2 per pack member significant?

Use the summary() function to test 'statistical significance'

summary(mod1)


Call:
lm(formula = homerange ~ packsize)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.702  -9.907   0.828   9.212  21.583 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    20.75      12.78    1.62     0.14
packsize        1.61       1.07    1.50     0.17

Residual standard error: 14 on 8 degrees of freedom
Multiple R-squared:  0.221, Adjusted R-squared:  0.123 
F-statistic: 2.26 on 1 and 8 DF,  p-value: 0.171

In this example, the estimated slope is 1.612 and its standard error is 1.071, so the t-statistic is 1.612/1.071 = 1.505.

There are 8 degrees of freedom (10 points - 2 parameters estimated = 8). With the t-statistic and df, we can determine the likelihood of getting a slope this steep by chance (if Ho is true), which is 0.171 or 17.1%.

In other words, the 'P-value' is 0.171. This is labelled Pr(>|t|) in the summary table, because it is the probability of getting a absolute value of t larger than the calculated value, by chance, if the null hypothesis was actually correct.

This is a two-tailed or nondirectional test, because either a positive or a negative slope would both considered reason to reject the null hypothesis \( H_{0} \) and accept the alternate hypothesis \( H_{A} \).

In our example, the logic of the hypothesis is inherently directional.

\[ H_{0}: \beta_{1} \leq 0 \]

\[ H_{A}: \beta_{1} > 0 \]

So the proper P-value is for a t-test for a positive slope larger than the one seen in the data. For a one-tailed test like this, simply divide the reported P-value by 2.

\[ P_{1 tail} = \frac{0.171}{2} = 0.085 \]

library(ggplot2)

data1 <- data.frame(packsize, homerange)

ggplot(data1, aes(x=packsize, y=homerange)) +
    geom_point(colour = 'red', size = 6) +   
    geom_smooth(method=lm) +
  ylab("Home Range (km2)") +
  xlab("Pack Size (adults)")

The ggpplot() function in the ggplot2 package provides an easy way to plot a regression with its 95% confidence limits.

The true regession line has a 95% chance of lying within the estimated 95% confidence interval.

plot of chunk unnamed-chunk-8

In this case, the 95% CI (grey) for the regression line (blue) includes slopes of zero (horizontal) so the slope does not differ from zero with \( \geq \) 95% confidence.

In other words, we can't have 95% confidence that home range size is related to pack size, although there is reasonable evidence that it is.

plot of chunk unnamed-chunk-9

Don't be a slave to the view that P = 0.049 is fundamentally different than P = 0.051.
The 90% confidence interval is plotted here. As you accept lower confidence, the interval gets narrower.

To alter the confidence level shown by the ggplot() with geom_smooth() –

geom_smooth(method=lm, level = 0.90)

'Statistical' and 'Biological' Significance

The P-value from OLS regression of 0.085 gives the probability of obtaining a slope as steep as 1.61 by chance (if \( H_{0} \) was correct).

The P-value is from a t-test that compares the slope to its standard error. If the slope is small or its standard error is big, the t-statistic will be small, and the P value will indicate that such a result could easily occur by chance.

Assessing 'Biological' Significance

So a P-value inherently says something about how 'big' the slope is (relative to its standard error), but it does not directly say anything about the biological significance of the effect.

Even with a small P-value, the effect size (the magnitude of the slope) should be evaluated for ecological or biological importance. There are no hard and fast rules to evaluate biological significance. This requires an understanding of the species or system you are studying.

Plotting the regression line with the raw data is a good first step to evaluate biological significance.

Assessing Biological Significance

R code to plot the data and add the OLS regression line

plot(y = homerange, x = packsize, xlab = "Pack Size (adults)", ylab = "Home Range (km2)", col = 'red', pch = 19, cex = 2.5, cex.axis = 1.3, cex.lab = 1.3)

abline(mod1, col = 'red') # abline() plots the regression line using the output from lm() saved in mod1

(output plot is on the next slide)

Assessing Biological Significance

plot of chunk unnamed-chunk-11

The regression line includes a broad span of home range sizes. The biggest \( \widehat{Y} \) values are about 50 \( km^{2} \), compared to the smallest \( \widehat{Y} \) values of just under 30 \( km^{2} \).

An effect size this large seems biologically significant. Big packs are covering an area almost 20 \( km^{2} \) larger than small packs, or 167% larger.

Multiple Regression

What if we are concerned with the effect of more than one independent variable (X1 and X2) on Y?

Modify the data to include a new variable, percent vegetation cover within each home range:

vegcover <- c(40,31,44,52,46,60,71,83,83,86)
data2 <- data.frame(packsize, homerange, vegcover)

Multiple Regression

Fit a multiple regression by OLS:

\[ \widehat{Y_{i}} = B_{0} + B_{1}*X1_{i} + B_{2}*X2_{i} \]

\[ Y_{i} = \widehat{Y_{i}} + \epsilon_{i} \]

\[ Y_{i} = B_{0} + B_{1}*X1_{i} + B_{2}*X2_{i} + \epsilon_{i} \]

We're now fitting a plane rather than a line.

We could any number of independent variables, although it becomes hard to visualize graphically.

The OLS regression uses the lm() function in the same way as before:

mod2 <- lm(homerange ~ packsize + vegcover)
mod2


Call:
lm(formula = homerange ~ packsize + vegcover)

Coefficients:
(Intercept)     packsize     vegcover  
      0.585       -0.725        0.777

Note that the estimated effect of pack size on home range size has changed, even though the data for pack size and home range area did not change! The estimated effect of each predictor often depends on the other predictors that are/are not in the regression model.

The output of summary(mod2) on the next slide can be interpreted the same way as before.

The coefficient of determination is listed as 'adjusted R-squared' and indicates that 80.6% of the variation in home range size can be explained by the two predictors, pack size and vegetation cover.

The F-statistic at the bottom tests whether the combination of pack size and vegetation cover explain variation in home range size in a manner that is unlikely to occur by chance if the null hypothesis \( H_{0} \) is true. The P-value on the bottom line is for this F-test.


Call:
lm(formula = homerange ~ packsize + vegcover)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.237  -0.535   0.513   3.189   6.937 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    0.585      7.074    0.08    0.936   
packsize      -0.725      0.664   -1.09    0.311   
vegcover       0.777      0.144    5.40    0.001 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.56 on 7 degrees of freedom
Multiple R-squared:  0.849, Adjusted R-squared:  0.806 
F-statistic: 19.7 on 2 and 7 DF,  p-value: 0.00133

Plot the results - see next slide for some help with interpretation.

The approach shown here requires the package car to be installed and loaded.

library(car)
scatterplotMatrix(data2[,1:3], diagonal="boxplot",smooth=FALSE)

plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-16

The main diagonal shows a boxplot with the median, 25th and 75th quartiles and raneg for each variable.

The other panels show the relationships between each pair of variables. Pack size is on the x- axis for the left 3 panels and on the y-axis for the top 3 panels. Home range is on the middle 3 panels each way. Vegetation cover on the y-axis for bottom 3 panels and the x-axis for right 3 panels.

plot of chunk unnamed-chunk-17

This data set has a strong collinearity problem. The two independent variables (packsize and vegcover) are correlated with one another, which makes it much harder to evaluate their effects on home range size.

The regression coefficient for pack size was 1.61 in the simple regression above. In this multiple regression the coefficient for pack size is -0.725. Not only has the estimate changed, but the sign has switched.

There is no really good statistical solution to problems of collinearity. The problem is fundamentally with the data itself. If the two predictors are not independent of one another, you can't estimate their effects very well.

You should:

Keep a close eye on the stability of the coefficient for a variable as other variables are added to the regression model
Examine the correlations between the independent variables.
Make cautious inferences when using data with obvious collinearities.

Another way to visualize the results, using ggplot()

plot of chunk unnamed-chunk-18

The value of vegetation cover determines the size of the points, so that all three variables can be considered at once.

The collinearity between pack size and vegetation cover results in big points tending to the right and small points tending to the left.

What if we want to test for relationships other than straight lines?

Just alter the equation in the lm() function. To test a second-order polynomial

mod.poly2 <- lm(homerange ~ poly(packsize, 2))
summary(mod.poly2)

(output on next slide)


Call:
lm(formula = homerange ~ poly(packsize, 2))

Residuals:
   Min     1Q Median     3Q    Max 
-12.23  -5.49  -1.70   4.34  18.16 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)            38.8        3.3   11.77  7.2e-06 ***
poly(packsize, 2)1     21.0       10.4    2.01    0.084 .  
poly(packsize, 2)2     28.2       10.4    2.71    0.030 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.4 on 7 degrees of freedom
Multiple R-squared:  0.619, Adjusted R-squared:  0.51 
F-statistic: 5.69 on 2 and 7 DF,  p-value: 0.0341

R Exercise Three extends this basic review of regression models by considering Generalized Linear Models or GLM. The glm() function accomplishes most of the same basic tasks as lm(), but it is more flexible.

With glm(family = gaussian) you will get exactly the same regression coefficients as lm(). This is worth doing at least once, to compare the presentation of output for lm() and glm()

The lm() function assumes that the data are normally distributed and there is a linear 'link' between Y and X. glm() allows for other distributions and links. Three of the most important distributions (and their default link functions) are:

family = gaussian(link = “identity”) - Same as OLS regression. Appropriate for normally distributed dependent variables

family = binomial(link = “logit”) - Appropriate for dependent variables that are binomial such as survival (lived vs died) or occupancy (present vs absent)

family = poisson(link = “log”) - Appropriate for dependent variables that are counts (integers only) such as group size

Brief glm() example to compare with lm()

mod4 <- glm(packsize ~ vegcover, family = poisson)
summary(mod4)


Call:
glm(formula = packsize ~ vegcover, family = poisson)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.064  -0.191   0.160   0.430   1.190  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.63533    0.33509    4.88  1.1e-06 ***
vegcover     0.01261    0.00501    2.52    0.012 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 15.0749  on 9  degrees of freedom
Residual deviance:  8.6606  on 8  degrees of freedom
AIC: 54.67

Number of Fisher Scoring iterations: 4