Scott Creel

31 Aug 14

Conservation Biology

BIOE 440R & BIOE 521

You must be online to view the equations in this presentation

Regression is a tool to test for a relationship between continous variables. Starting with a straight-line relationship between two variables:

\[ \widehat{Y_{i}} = B_{0} + B_{1}*X_{i} \]

\[ Y_{i} = \widehat{Y_{i}} + \epsilon_{i} \]

\[ Y_{i} = B_{0} + B_{1}*X_{i} +\epsilon_{i} \]

The estimated **regression coefficients** are

**y-intercept = \( B_{0} \) or \( \widehat{\beta_{0} } \),****slope = \( B_{1} \) or \( \widehat{\beta_{1} } \)**

for the best fitting line that relates Y to X.

\( \beta_{0} \) and \( \beta_{1} \) are the true intercept and slope for the entire population.

\( B_{0} \) and \( B_{1} \) are the estimated intercept and slope from a sample of that population.

The least squares estimates of the regression coefficients yield

\[ \min(\sum_{i=1}^{N}\epsilon_i^{2}) \]

That is, with the OLS estimates of \( \widehat{\beta(0)} \) and \( \widehat{\beta(1)} \) the sum of the squared residuals, \( \epsilon_{i}^2 \) is as small as possible.

For each point, the residual error ('residual') \( \epsilon_{i} \) is the difference between the home range size predicted by the regression and the actual home range size observed.

If all of the points fell right on the line, \( \sum_{i=1}^{N}\epsilon_i^{2} \) would be zero.

A worked example with R code.

First, input some simple data with two continuous variables.

```
packsize <- c(9,8,10,6,13,10,6,15,18,17)
#pack size as number of adults
homerange <- c(25,20,28,36,22,46,52,41,59,59)
#home range size in km2
```

Hypothesize that home range size depends on pack size.

Plot the data for an initial evaluation

```
plot(y = homerange, x = packsize, xlab = "Pack Size (adults)", ylab = "Home Range (km2)",
col = 'red', pch = 19, cex = 2.5, cex.axis = 1.3, cex.lab = 1.3)
# col = 'red' set the point coloer, pch sets the point type
# cex sets the point size, cex.axis and cex.lab set the size of axis tickmarks and labels
```

(*the plot from this code is on the next slide*)

There is a positive relationship, but is it strong enough to be meaningful?

OLS Regression tests how likely it is to get data with the observed relationship, if pack size does not actually affect home range size in a causal manner.

Formally, the OLS regression tests the hypothesis:

\[ H_{0}: \beta_{1} = 0 \]

\[ H_{A}: \beta_{1} \neq 0 \]

Using a t-test:

\[ t = \frac{B_{1}}{SE_{B_{1}}} \]

Fit a linear OLS regression to the data and display the coefficients:

```
mod1 <- lm(homerange ~ packsize)
mod1
```

```
Call:
lm(formula = homerange ~ packsize)
Coefficients:
(Intercept) packsize
20.75 1.61
```

homerange = 20.75 + 1.61 * pack size

Is an increase of 1.61 km2 per pack member significant?

Use the summary() function to test 'statistical significance'

```
summary(mod1)
```

```
Call:
lm(formula = homerange ~ packsize)
Residuals:
Min 1Q Median 3Q Max
-19.702 -9.907 0.828 9.212 21.583
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.75 12.78 1.62 0.14
packsize 1.61 1.07 1.50 0.17
Residual standard error: 14 on 8 degrees of freedom
Multiple R-squared: 0.221, Adjusted R-squared: 0.123
F-statistic: 2.26 on 1 and 8 DF, p-value: 0.171
```

In this example, the estimated slope is 1.612 and its standard error is 1.071, so the t-statistic is 1.612/1.071 = 1.505.

There are 8 degrees of freedom (10 points - 2 parameters estimated = 8). With the t-statistic and df, we can determine the likelihood of getting a slope this steep by chance (if Ho is true), which is 0.171 or 17.1%.

In other words, the 'P-value' is 0.171. This is labelled Pr(>|t|) in the summary table, because it is the probability of getting a absolute value of t larger than the calculated value, by chance, if the null hypothesis was actually correct.

This is a **two-tailed** or **nondirectional** test, because either a positive or a negative slope would both considered reason to reject the null hypothesis \( H_{0} \) and accept the alternate hypothesis \( H_{A} \).

In our example, the logic of the hypothesis is inherently **directional**.

\[ H_{0}: \beta_{1} \leq 0 \]

\[ H_{A}: \beta_{1} > 0 \]

So the proper P-value is for a t-test for a **positive** slope larger than the one seen in the data. For a **one-tailed** test like this, simply divide the reported P-value by 2.

\[ P_{1 tail} = \frac{0.171}{2} = 0.085 \]

```
library(ggplot2)
data1 <- data.frame(packsize, homerange)
ggplot(data1, aes(x=packsize, y=homerange)) +
geom_point(colour = 'red', size = 6) +
geom_smooth(method=lm) +
ylab("Home Range (km2)") +
xlab("Pack Size (adults)")
```

The ggpplot() function in the ggplot2 package provides an easy way to plot a regression with its 95% confidence limits.

The true regession line has a 95% chance of lying within the estimated 95% confidence interval.

In this case, the 95% CI (grey) for the regression line (blue) includes slopes of zero (horizontal) so the slope does not differ from zero with \( \geq \) 95% confidence.

In other words, we can't have 95% confidence that home range size is related to pack size, although there is reasonable evidence that it is.

Don't be a slave to the view that P = 0.049 is **fundamentally different** than P = 0.051.

The 90% confidence interval is plotted here. As you accept lower confidence, the interval gets narrower.

To alter the confidence level shown by the *ggplot()* with *geom_smooth()* –

- geom_smooth(method=lm, level = 0.90)

The P-value from OLS regression of 0.085 gives the probability of obtaining a slope as steep as 1.61 by chance (if \( H_{0} \) was correct).

The P-value is from a t-test that compares the slope to its standard error. If the **slope is small** or its **standard error is big**, the **t-statistic will be small**, and the P value will indicate that such a result could easily occur by chance.

So a P-value inherently says something about how 'big' the slope is (relative to its standard error), but it does not directly say anything about the **biological significance** of the effect.

Even with a small P-value, the **effect size** (the magnitude of the slope) should be evaluated for ecological or biological importance. There are no hard and fast rules to evaluate biological significance. This requires an understanding of the species or system you are studying.

Plotting the regression line with the raw data is a good first step to evaluate biological significance.

R code to plot the data and add the OLS regression line

```
plot(y = homerange, x = packsize, xlab = "Pack Size (adults)", ylab = "Home Range (km2)", col = 'red', pch = 19, cex = 2.5, cex.axis = 1.3, cex.lab = 1.3)
abline(mod1, col = 'red') # abline() plots the regression line using the output from lm() saved in mod1
```

(output plot is on the next slide)

The regression line includes a broad span of home range sizes. The biggest \( \widehat{Y} \) values are about 50 \( km^{2} \), compared to the smallest \( \widehat{Y} \) values of just under 30 \( km^{2} \).

An **effect size** this large seems biologically significant. Big packs are covering an area almost 20 \( km^{2} \) larger than small packs, or 167% larger.

What if we are concerned with the effect of more than one independent variable (X1 and X2) on Y?

Modify the data to include a new variable, percent vegetation cover within each home range:

```
vegcover <- c(40,31,44,52,46,60,71,83,83,86)
data2 <- data.frame(packsize, homerange, vegcover)
```

Fit a multiple regression by OLS:

\[ \widehat{Y_{i}} = B_{0} + B_{1}*X1_{i} + B_{2}*X2_{i} \]

\[ Y_{i} = \widehat{Y_{i}} + \epsilon_{i} \]

\[ Y_{i} = B_{0} + B_{1}*X1_{i} + B_{2}*X2_{i} + \epsilon_{i} \]

We're now fitting a plane rather than a line.

We could any number of independent variables, although it becomes hard to visualize graphically.

The OLS regression uses the *lm()* function in the same way as before:

```
mod2 <- lm(homerange ~ packsize + vegcover)
mod2
```

```
Call:
lm(formula = homerange ~ packsize + vegcover)
Coefficients:
(Intercept) packsize vegcover
0.585 -0.725 0.777
```

Note that the estimated effect of pack size on home range size has changed, even though the data for pack size and home range area did not change! **The estimated effect of each predictor often depends on the other predictors that are/are not in the regression model.**

The output of summary(mod2) on the next slide can be interpreted the same way as before.

The **coefficient of determination** is listed as 'adjusted R-squared' and indicates that 80.6% of the variation in home range size can be explained by the two predictors, pack size and vegetation cover.

The **F-statistic** at the bottom tests whether the combination of pack size and vegetation cover explain variation in home range size in a manner that is unlikely to occur by chance if the null hypothesis \( H_{0} \) is true. The P-value on the bottom line is for this F-test.

```
Call:
lm(formula = homerange ~ packsize + vegcover)
Residuals:
Min 1Q Median 3Q Max
-13.237 -0.535 0.513 3.189 6.937
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.585 7.074 0.08 0.936
packsize -0.725 0.664 -1.09 0.311
vegcover 0.777 0.144 5.40 0.001 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.56 on 7 degrees of freedom
Multiple R-squared: 0.849, Adjusted R-squared: 0.806
F-statistic: 19.7 on 2 and 7 DF, p-value: 0.00133
```

Plot the results - see next slide for some help with interpretation.

The approach shown here requires the package **car** to be installed and loaded.

```
library(car)
scatterplotMatrix(data2[,1:3], diagonal="boxplot",smooth=FALSE)
```

The main diagonal shows a boxplot with the median, 25th and 75th quartiles and raneg for each variable.

The other panels show the relationships between each pair of variables. Pack size is on the x- axis for the **left 3 panels** and on the y-axis for the **top 3 panels**. Home range is on the middle 3 panels each way. Vegetation cover on the y-axis for bottom 3 panels and the x-axis for right 3 panels.

This data set has a strong **collinearity** problem. The two independent variables (packsize and vegcover) are correlated with one another, which makes it much harder to evaluate their effects on home range size.

The regression coefficient for pack size was 1.61 in the simple regression above. In this multiple regression the coefficient for pack size is -0.725. Not only has the estimate changed, but the sign has switched.

There is no really good statistical solution to problems of collinearity. The problem is fundamentally with the data itself. If the two predictors are not independent of one another, you can't estimate their effects very well.

You should:

Keep a close eye on the stability of the coefficient for a variable as other variables are added to the regression model

Examine the correlations between the independent variables.

Make cautious inferences when using data with obvious collinearities.

Another way to visualize the results, using *ggplot()*

The value of vegetation cover determines the size of the points, so that all three variables can be considered at once.

The collinearity between pack size and vegetation cover results in big points tending to the right and small points tending to the left.

What if we want to test for relationships other than straight lines?

Just alter the equation in the lm() function. To test a second-order polynomial

```
mod.poly2 <- lm(homerange ~ poly(packsize, 2))
summary(mod.poly2)
```

(output on next slide)

```
Call:
lm(formula = homerange ~ poly(packsize, 2))
Residuals:
Min 1Q Median 3Q Max
-12.23 -5.49 -1.70 4.34 18.16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.8 3.3 11.77 7.2e-06 ***
poly(packsize, 2)1 21.0 10.4 2.01 0.084 .
poly(packsize, 2)2 28.2 10.4 2.71 0.030 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.4 on 7 degrees of freedom
Multiple R-squared: 0.619, Adjusted R-squared: 0.51
F-statistic: 5.69 on 2 and 7 DF, p-value: 0.0341
```

R Exercise Three extends this basic review of regression models by considering **Generalized Linear Models** or **GLM**. **The glm()** function accomplishes most of the same basic tasks as lm(), but it is more flexible.

With glm(family = gaussian) you will get exactly the same regression coefficients as lm(). This is worth doing at least once, to compare the presentation of output for lm() and glm()

The **lm()** function assumes that the data are normally distributed and there is a linear 'link' between Y and X. **glm()** allows for other distributions and links. Three of the most important distributions (and their default link functions) are:

**family = gaussian(link = “identity”)** - Same as OLS regression. Appropriate for normally distributed dependent variables

**family = binomial(link = “logit”)** - Appropriate for dependent variables that are binomial such as survival (lived vs died) or occupancy (present vs absent)

**family = poisson(link = “log”)** - Appropriate for dependent variables that are counts (integers only) such as group size

Brief glm() example to compare with lm()

```
mod4 <- glm(packsize ~ vegcover, family = poisson)
summary(mod4)
```

```
Call:
glm(formula = packsize ~ vegcover, family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.064 -0.191 0.160 0.430 1.190
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.63533 0.33509 4.88 1.1e-06 ***
vegcover 0.01261 0.00501 2.52 0.012 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 15.0749 on 9 degrees of freedom
Residual deviance: 8.6606 on 8 degrees of freedom
AIC: 54.67
Number of Fisher Scoring iterations: 4
```