---
title: "R Notebook 4 - linear models and logit link"
output:
  html_document:
    df_print: paged
---
```{r setup, include=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(knitr)
require(tidyverse)
require(ggpubr)
```

This notebook has you work with design matrices and $\hat\beta_i$ for a modest number of categorical covariates and 1 annual, continuous covariate. The goal is to review some key ideas of linear models and working with the logit link.  

### Create a simple data set

First, we'll create the underlying covariate values for a 7-year live-recaptures study where the covariates are: (1) year (indicator variables for each year: t1, t2, ..., t7), (2) sex (indicator: female = 0 and male = 1), and (3) river flow (a continuous variable). Here, covariate values are shared by many animals, such that we can present all covariate combinations in a data.frame with just 12 rows and 3 columns: rows 1-6 = values for males and rows 7-12 = values for females. 

```{r generate data, echo=FALSE}
# year is represented as continuous (T) & as a factor (t)
dat <- expand.grid(T = 1:4, male = factor(c(1, 0)))
dat$t <- factor(dat$T)
dat$flow <- rep(c(17.5, 10.7, 5.9, 11.0), 2)
dat
```

In the work below, we'll pretend that a 7-year live-recaptures study was done on many animals with those covariate combinations, which provided you with information on how many animals had each of the possible encounter histories. We'll also pretend that you ran the competing models below on the data and obtained the $\hat\beta_i$ provided below to you for each model. You don't need to worry about whether the model structures are for $\phi$ or for $p$: we'll just call it $p$ in a generic way, i.e., probability. The key here is to be sure you understand how to use covariates in a given model structure along with the $\hat\beta_i$ to (1) obtain estimates of (a) the log-odds and (b) probability for each covariate combination, and (2) plot the relationships between covariates and the estimates. 

#### Intercept-only model (complete pooling)

```{r, echo=FALSE}
model.matrix(~ 1, data = dat)
```

1. We'll refer this to model as "m0". Using $\beta_1 = 1.25$, calculate the $ln\frac{p}{(1-p)}$ and $p$ for model "m0" using the code block below. Look at the estimates.

```{r, echo=FALSE}
# name the output 'm0' and store the data
m0 <- dat
# add log-odds: '%*%' does matrix multiplication
# matrix multiply the design matrix and the beta-hat values to get log-odds(p)
m0$lo.p <- model.matrix(~ 1, data = dat) %*% 1.25
# back tranform with the inverse logit to obtain 'p-hats'
m0$p <- plogis(m0$lo.p)
m0
```

2. Plot the point estimates by year 

```{r, echo=FALSE}
# store a plot the log odds by year
pl.lo <- ggplot(m0, aes(x = T, y = lo.p)) + 
  geom_point() + geom_line(linetype = 2) + 
  ylim(-5, 5) + 
  xlab("year")

# store a plot the real parameter, i.e., the probability scale
pl.p <- ggplot(m0, aes(x = T, y = p)) + 
  geom_point() + geom_line(linetype = 2) +
  xlab("year") +
  ylim(0, 1)

# plot the 2 plots side by side using 'ggarange' from ggpubr pkg
ggarrange(pl.lo, pl.p, 
          ncol = 2, nrow = 1)
```

#### Years different (year as factor, i.e., no pattern imposed; groups pooled)

```{r, echo=FALSE}
model.matrix(~ t, data = dat)
```

3. We'll refer this to model as "mt". Using $\beta_i = 0.94, 0.71, -0.10, 0.50$, calculate the $ln\frac{p}{(1-p)}$ and $p$ for model "mt" using the code block below. Look at the estimates.

4. Use the design matrix and the $\beta_i$ to calculate each of the 4 values for $ln\frac{p_i}{(1-p_i)}$ on your own (R console, calculator, excel, etc). 

5. Backtransform your values of $ln\frac{p_i}{(1-p_i)}$ on your own to obtain estimates of $p$.

```{r, echo=FALSE}
# name the output 'mt' and store the data
mt <- dat
# add log-odds: '%*%' does matrix multiplication
# matrix multiply the design matrix and the beta-hat values to get log-odds(p)
mt$lo.p <- model.matrix(~ t, data = dat) %*% c(0.94, 0.71, -0.10, 0.50)
# back tranform with the inverse logit to obtain 'p-hats'
mt$p <- plogis(mt$lo.p)
mt
```

6. Plot the point estimates by year 

```{r, echo=FALSE}
# store a plot the log odds by year
pl.lo <- ggplot(mt, aes(x = T, y = lo.p)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(-5, 5) + 
  xlab("year")

# store a plot the real parameter, i.e., the probability scale
pl.p <- ggplot(mt, aes(x = T, y = p)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(0, 1) + 
  xlab("year")

# plot the 2 plots side by side using 'ggarange' from ggpubr pkg
ggarrange(pl.lo, pl.p, 
          ncol = 2, nrow = 1)
```

#### Year + Group (year as a factor; same group adjustment added each year)

```{r, echo=FALSE}
model.matrix(~ t + male, data = dat)
```

7. We'll refer this to model as "mt.pl.g". Using $\beta_i = 1.3, 0.71, -0.10, 0.50, -0.66$, calculate the $ln\frac{p}{(1-p)}$ and $p$ using the code block below. Look at the estimates. 

```{r, echo=FALSE}
# name the output 'mt' and store the data
mt.pl.g <- dat
# add log-odds: '%*%' does matrix multiplication
# matrix multiply the design matrix and the beta-hat values to get log-odds(p)
mt.pl.g$lo.p <- model.matrix(~ t + male, data = dat) %*% c(1.3, 0.71, -0.10, 0.50, -0.66)
# back tranform with the inverse logit to obtain 'p-hats'
mt.pl.g$p <- plogis(mt.pl.g$lo.p)
mt.pl.g
```

8. Try working out the values for $ln\frac{p}{(1-p)}$ and $p$ for year 2 for males and for year 4 for females without the code block to be sure you understand what's happening in the calculations. 

9. Plot the point estimates by year and group. An important thing to notice here is that the difference between the $ln\frac{p}{(1-p)}$ for females and males is the same in every year (additive effect), whereas the differences in annual estimates of $p$ are not identical due to back-transforming with the inverse-logit.

```{r, echo=FALSE}
# store a plot the log odds by year
pl.lo <- ggplot(mt.pl.g, aes(x = T, y = lo.p, color = male)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(-5, 5) + 
  xlab("year")

# store a plot the real parameter, i.e., the probability scale
pl.p <- ggplot(mt.pl.g, aes(x = T, y = p, color = male)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(0, 1) + 
  xlab("year")

# plot the 2 plots side by side using 'ggarange' from ggpubr pkg
ggarrange(pl.lo, pl.p, 
          ncol = 2, nrow = 1)
```

#### Year x Group model (year as a factor)

```{r, echo=FALSE}
model.matrix(~ t * male, data = dat)
```

```{r, echo=FALSE}
# name the output 'mt.x.g' and store the data
mt.x.g <- dat
# add log-odds: '%*%' does matrix multiplication
# matrix multiply the design matrix and the beta-hat values to get log-odds(p)
mt.x.g$lo.p <- model.matrix(~ t * male, data = dat) %*% c(1.3, 0.71, -0.10, 0.50, -0.66, -1.2, 2.4, -0.9)
# back tranform with the inverse logit to obtain 'p-hats'
mt.x.g$p <- plogis(mt.x.g$lo.p)
mt.x.g
```

10. Try working out the values for $ln\frac{p}{(1-p)}$ and $p$ for year 2 for males and for females without the code block to be sure you understand what's happening in the calculations.

11. Plot the point estimates by year and group. 

```{r, echo=FALSE}
# store a plot the log odds by year
pl.lo <- ggplot(mt.x.g, aes(x = T, y = lo.p, color = male)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(-5, 5) + 
  xlab("year")

# store a plot the real parameter, i.e., the probability scale
pl.p <- ggplot(mt.x.g, aes(x = T, y = p, color = male)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(0, 1) + 
  xlab("year")

# plot the 2 plots side by side using 'ggarange' from ggpubr pkg
ggarrange(pl.lo, pl.p, 
          ncol = 2, nrow = 1)
```

#### Trend across Years (year as numeric, i.e., pattern imposed; groups pooled)

```{r, echo=FALSE}
model.matrix(~ T, data = dat)
```

12. We'll refer this to model as "mT" as we're using "T", which is a continuous variable (1:4). Using $\beta_i = -0.2, 0.5$, calculate the $ln\frac{p}{(1-p)}$ and $p$ for model "mT" using the code block below. Look at the estimates. 

13. Be sure you know how to use the design matrix and the $\beta_i$ to calculate each of the 4 values for $ln\frac{p_i}{(1-p_i)}$ on your own. 

```{r, echo=FALSE}
# name the output 'mT' and store the data
mT <- dat
# add log-odds: '%*%' does matrix multiplication
# matrix multiply the design matrix and the beta-hat values to get log-odds(p)
mT$lo.p <- model.matrix(~ T, data = dat) %*% c(-0.2, 0.5)
# back tranform with the inverse logit to obtain 'p-hats'
mT$p <- plogis(mT$lo.p)
mT
```

14. Plot the point estimates by year and note that the linear trend (which we imposed using the continous version of year, "T" in the design matrix). Also, note that there is a slight curve to the values in the back-transformed values of 'p'.

```{r, echo=FALSE}
# store a plot the log odds by year
pl.lo <- ggplot(mT, aes(x = T, y = lo.p)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(-5, 5) + 
  xlab("year")

# store a plot the real parameter, i.e., the probability scale
pl.p <- ggplot(mT, aes(x = T, y = p)) + 
  geom_point() + geom_line(linetype = 2) +
  ylim(0, 1) + 
  xlab("year")

# plot the 2 plots side by side using 'ggarange' from ggpubr pkg
ggarrange(pl.lo, pl.p, 
          ncol = 2, nrow = 1)
```

15. Consider which of the models above has the most constraints and which is the most flexible, i.e., most general or unconstrained, and how that relates to which model might fit best and to model complexity versus simplicity. 

16. Finally, can you imagine a model that's even more flexible than any of the models above?