You have manipulated the Parameter Index Matrices (PIMs) to set up models of interest based on material in Chapter 3 and 4 of CW. Now, you’ll move on to chapter 6 of CW to gain experience with slightly more complex model structures and to gain experience with the logit link.
When you start a new analysis project in Program MARK, you indicate what type of data you’ll be working with and how many groups you’ll be entering data for. You will have 1 PIM for each parameter type for each group. For live recaptures data, if you enter data for 2 groups (male and female dippers) for a 7-year study, you will have a PIM for \(\phi_{male}\), \(\phi_{female}\), \(p_{male}\), and \(p_{female}\). Each PIM for this data type will be triangular and, by default, initially be numbered as shown below where each column of each PIM has it’s own unique number.
The values in the PIMs establish how many rows are in the design matrix: there will be 1 row for each unique value. With the PIMs numbered as above, you will have 24 rows in your design matrix: rows 1-6 for \(\phi_{male}\), rows 7-12 for \(\phi_{female}\), rows 13-18 for \(p_{male}\), and rows 19-24 for \(p_{female}\). You can now work with the design matrix to setup a linear model for each parameter as you wish. We’ve already worked through the identity matrix and how it can be used to run an intercept-only model for each parameter, which for the PIMs above would yield model \(\phi_{sex \cdot time}, p_{sex \cdot time}\). You can set up other models by altering how many columns are in the design matrix and what is entered into the various cells of the design matrix.
We will encounter covariates that have a variety of features: (1) categorical vs. continuous, (2) time varying vs. constant through time, (3) shared by many individuals vs. unique to an individual, etc. Here, we’ll begin with categorical variables that are treated as factors. Covariates that indicate which group an individual is in can be in various ways. First, the input file might contain encounter histories for multiple groups such that a PIM for every parameter for the data type will be created for each group as described above and in our previous day’s notes. An example of grouped data for males and females for a 6-year study appear below; the capture history appears first, followed by the frequency of animals in group 1 with that history being presented on that row and then the frequency of animals in group 2 with that history being presented on that row.
11001 14 20 ;
10110 12 43 ;
…
00010 15 32 ;
Grouping variables can also be entered as individual covariates as discussed in chapter 11 of CW and as you’ll see later in the semester. Continuous covariate values, e.g., body mass, can be entered as individual covariates in the input file. Continuous covariate values that provide information on annual conditions, like our snow pack example from the previous set of notes, can be entered directly into the design matrix as in the example below for a 5-year study where the 1st 4 rows of the design matrix are for \(\phi_1, \phi_2, \ldots, \phi_4\) and the last row is for \(p_.\).
\[ \left[\begin{array}{cc|c} 1 & 125 & 0\\ 1 & 254 & 0\\ 1 & 87 & 0\\ 1 & 180 & 0\\ \hline 0 & 0 & 1 \end{array}\right] \]
Continuous covariates on annual conditions can also be entered as individual covariates - for our snow pack example above in a 5-year study, you’d need to have 4 individual covariates in the input file: sp_yr1, sp_yr2, sp_yr3, and sp_yr4. Chapter 11 of CW presents helpful details on why/when you might prefer doing so: bottom line for now is that the estimates come out the same either way so no need to worry about that for now. If you include individual covariates that take on continuous values in the input file, it’s typical to present data for just 1 animal on a given row because it’s unlikely that multiple animals have the same exact values, especially if there are multiple individual covariates involved. Below, you can see an example for the same scenario as the one above but with body mass data added as a new column and only 1 animal per row.
11001 1 0 18.25 ;
10110 0 1 13.76 ;
…
00010 0 1 20.42 ;
We’ll use R’s model.matrix
function to quickly review some basic linear model structures that are important to understand. To keep things simple, we’ll just work with the portion of the design matrix that’s relevant for \(\phi_i\), we’ll always assume that the PIM numbering has unique numbers in each column (see below), and in the design matrices, the 4 rows for males will have white background while those for females will have a gray background.
\[ {PIM}_{\phi_{male}} = \left[\begin{array}{cc} 1 & 2 & 3 & 4\\ & 2 & 3 & 4\\ & & 3 & 4\\ & & & 4 \end{array}\right]\qquad {PIM}_{\phi_{female}} = \left[\begin{array}{cc} 5 & 6 & 7 & 8\\ & 6 & 7 & 8\\ & & 7 & 8\\ & & & 8 \end{array}\right] \]
In this model \(\phi\) is constrained to be the same value regardless of the year or group, i.e., all the data for \(\phi\) are pooled.
(Intercept) |
---|
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
Notice that year 1 is the reference value (intercept alone is used for the estimate) and that the values are the same for the male rows (1-4) and the female rows (5-8). In MARK, for time effects in the design matrix, the last year is the default reference value, which you can change under the File menu’s “Preferences” option if you like.
(Intercept) | t2 | t3 | t4 |
---|---|---|---|
1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
1 | 0 | 0 | 1 |
Notice that females are the reference value (intercept alone is used for their estimate) and that all rows for a group are identical (years are pooled). Here, rather than use a variable name like “sex”, I use “male1”, which serves as a reminder as to which group is being coded (indicated) with a 1.
(Intercept) | male1 |
---|---|
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 0 |
1 | 0 |
1 | 0 |
1 | 0 |
Here, \(\phi\) is allowed to vary by year and there is a common adjustment for \(\phi_{male}\) across all 4 years. Females in year 1 are the reference group.
(Intercept) | t2 | t3 | t4 | male1 |
---|---|---|---|---|
1 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 | 1 |
1 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 |
Here, \(\phi\) is allowed to vary freely by year and to do so differently for each group. This model could be achieved with an 8 x 8 identity matrix (which is what you do if you just work with the PIMs and agree to use the identity matrix). Here, the model is setup in a way that you might have seen in a statistics class in an ANOVA presentation. Females in year 1 are the reference group. Notice that the “t2:male1” is the product of the “t2” column and the “male1” column: the product takes on a value of 1 in the row where “t2” = 1 and “male1” = 1 and takes on a value of 0 in all the others.
(Intercept) | t2 | t3 | t4 | male1 | t2:male1 | t3:male1 | t4:male1 |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Here, \(\phi\) is allowed to vary by year but is constrained to follow a trend that’s related to annual snow pack values: \(\phi_{sp}\) across all 4 years. The intercept value is for a year with 0 snow. The rows for males and females are identical.
(Intercept) | sp |
---|---|
1 | 125 |
1 | 254 |
1 | 87 |
1 | 180 |
1 | 125 |
1 | 254 |
1 | 87 |
1 | 180 |
Here, \(\phi\) is allowed to vary by year but is constrained to follow a trend that’s related to annual snow pack values and snow pack values squared, i.e., a quadratic relationship.
(Intercept) | sp | I(sp^2) |
---|---|---|
1 | 125 | 15625 |
1 | 254 | 64516 |
1 | 87 | 7569 |
1 | 180 | 32400 |
1 | 125 | 15625 |
1 | 254 | 64516 |
1 | 87 | 7569 |
1 | 180 | 32400 |
Here, \(\phi\) is allowed to vary by year but is constrained to follow a trend that’s related to annual snow pack values: \(\phi_{sp}\) across all 4 years. The intercept value is for females in a year with a snow pack value of 0. The males have an intercept adjustment. The slope term associated with “sp” is constrained to be the same for males and females.
(Intercept) | male1 | sp |
---|---|---|
1 | 1 | 125 |
1 | 1 | 254 |
1 | 1 | 87 |
1 | 1 | 180 |
1 | 0 | 125 |
1 | 0 | 254 |
1 | 0 | 87 |
1 | 0 | 180 |
Here, \(\phi\) within each group is allowed to vary by year but is constrained to follow a trend that’s related to annual snow pack values. The intercept value is for females in a year with a snow pack value of 0. The males have an intercept adjustment and a slope adjustment such that the 2 groups are allowed to have different intercepts and slopes.
(Intercept) | male1 | sp | male1:sp |
---|---|---|---|
1 | 1 | 125 | 125 |
1 | 1 | 254 | 254 |
1 | 1 | 87 | 87 |
1 | 1 | 180 | 180 |
1 | 0 | 125 | 0 |
1 | 0 | 254 | 0 |
1 | 0 | 87 | 0 |
1 | 0 | 180 | 0 |
Quoting from chapter 6-11 of CW:
…we now need to consider … how can we use ‘regression models’ for analysis of mark-recapture data, since … survival and recapture are not ‘normal’ response variables – normal in the sense that they are both constrained to be values from 0→1? If you simply regressed ‘live = 1 / dead = 0’or ‘seen = 1, not seen = 0’ on some set of explanatory variables, \(x\), it is quite conceivable that for some values of \(x\) the estimated value of survival or recapture would be >1 or <0 … the way around this problem is to transform the probability of survival or recapture, such that the transformed probabilities have been mapped from [0, 1] to [−∞, +∞].
The logit link function
As mentioned, in the previous set of notes, we will work with the logit link when we can’t use an identity matrix for a given model of interest. If we want to model \(\phi_t\) as function of snow pack as \(\beta_0 + \beta_1 \cdot sp_t\), we use the following logistic regression equation.
\[\hat\phi_t = \frac{exp^{(\hat\beta_1 + \hat\beta_2 \cdot {sp}_t)}}{1 + exp^{(\hat\beta_1 + \hat\beta_2 \cdot {sp}_t)}} = \frac{1}{1 + exp^{-(\hat\beta_1 + \hat\beta_2 \cdot {sp}_t)}}= \frac{1}{1 + exp^{(-\hat\beta_1 -\hat\beta_2 \cdot {sp}_t)}}\]
The logit transformation of that regression equation is as follows.
\[logit(\phi_t) = ln \left(\frac{\phi_t}{(1-\phi_t)}\right) = \beta_0 + \beta_1 \cdot sp_t\]
The logit transformation allows us to model our real parameters of interest, e.g., \(\phi_t, p_t\) using linear models that include covariates of interest while also ensuring that the estimates of the real parameters remain in the 0, 1 bounds. It’s important to realize and understand that it’s \(logit(\phi_t)\), or equivalently the log-odds of \(\phi_t\), that’s being modeled as having a linear relationship with snow pack in the model above. The relationship between \(\phi_t\) is not necessarily linear as the graphics below illustrate. But first, a bit of background on what odds are and how one can transform from a survival rate to odds to log-odds and back.
If \(\phi_t = 0.8\), the odds of surviving = 4: \(\frac{0.8}{(1-0.8)} = \frac{0.8}{0.2}\). The log-odds of surviving is \(ln \left(\frac{0.8}{0.2}\right)\), which equals 1.3862944.
The inverse-logit function can be used to back transform from the log-odds to the real parameter of interest, \(\phi_t\) as follows.
\[\hat\phi_t = \frac{exp^{(1.386)}}{1 + exp^{(1.386)}} = 0.8\]
In R, you can get the log-odds for a given real parameter value using the qlogis
function, e.g., qlogis(0.8)
, which yields 1.3862944. And, if you have the log-odds, e.g., from working model output (the \(\hat\beta_i\)) and the model’s design matrix, you can obtain estimates of the real parameters using the plogis
function, e.g., plogis(1.3862944)
, which yields 0.8.
The relationship between values on the log-odds scale and the back-transformed values on the real parameter scale have a sigmoid relationship as shown below. Notice that the values -5, 0, and +5 on the log-odds axis translate to ~0, 0.5, and ~1, respectively, on the probability scale. Also, notice the symmetry whereby -/+ versions of values on the log-odds scale translate to \(p\) and \(1-p\), respectively, on the probability scale, e.g., -2 and +2 for log-odds values yield 0.1192029 and 0.8807971, respectively, for probabilities.
Your logistic regression equation, e.g., \(\hat{\beta_0} + \hat{\beta}_{1}\cdot{sp_t}\) yields the log-odds of \(\hat\phi_t\) or \(logit(\hat\phi_t)\), and for this model that relationship is linear. When that equation is back-transformed to obtain \(\hat\phi_t\) using $ $ that relationship is certainly non-linear over the range of covariate values that would cause \(\hat\phi_t\) to range from 0 to 1. But, the relationship might be fairly linear (or not) over the range of observed covariate values: it all depends on the specific problem. Fortunately, as you’re learning, MARK has useful tools for plotting the estimated relationship between covariates and the real parameters. And, it also has tools for working with the \(\hat\beta_i\) and the variance-covariance matrix for the \(\hat\beta_i\) to estimate the uncertainty for estimated values of the real parameters for different covariate combinations.
In the plots below, \(\hat\phi_t = \frac{exp^{(\hat\beta_1 + \hat\beta_2 \cdot {sp_t})}}{1 + exp^{(\hat\beta_1 + \hat\beta_2 \cdot {sp_t})}}\), where \(\hat\beta_1 = -10\) and \(\hat\beta_2 = 0.05\). Notice the following:
the intercept value is for a snow pack value of 0, which is well below any of the observed values, which range from 87 to 254 (shaded in light blue),
on the log-odds scale the relationship is linear, whereas the relationship between \(\hat\phi\) and snow pack is not (although it’s quite linear over some snow pack values, e.g., 175 to 225),
snow pack values that yield a log-odds value of \(~-5\), yield \(\hat\phi \approx 0\), those that yield a log-odds value of \(0\), yield \(\hat\phi = 0.5\), and those that yield a log-odds value of \(~+5\), yield \(\hat\phi \approx 1\),
Values of \(\hat\phi\) range from 0 to 1 only if you move well beyond the range of observed snow pack values, i.e., in many logistic regression problems, you’ll only see part of the full sigmoid curve when plotting over the range of observed covariate values, which is not a problem, and
You would typically only use the left-hand plots, i.e., plot over the range of observed covariate values, as that’s the inference space.
What you’re learning about using various covariate types brings much useful flexibility to how you can model data from mark-recapture studies. In particular, we started with the CJS model with time-varying real parameters with a likelihood such as the one below.
\[\mathcal{L}(\phi_1, p_2, \phi_2, p3|R_1, {x_\omega})= \frac{R_1!}{\prod\limits_{\omega}x_\omega!}(\phi_1 p_2 \phi_2 p_3)^{x_{111}} (\phi_1 p_2 (1-\phi_2 p_3))^{x_{110}}(\phi_1(1-p_2)\phi_2 p_3)^{x_{101}} (\chi_1)^{x_{100}}\]
You then learned to run models that held either or both of the real parameters constant and to evaluate evidence for which version of the 4 models: (1) \(\phi_t,p_t\), (2) \(\phi_.,p_t\), (3) \(\phi_t,p_.\), and (4) \(\phi_.,p_.\) was best supported by the data.
Now, you can use logistic regression to evaluate support from the data for other models, e.g., \(phi_{snow}, p{sex}\). To do so, you conceive of the linear model for each of the real parameters, number the PIMs accordingly, call up a design matrix with the right number of columns, fill in the appropriate values in the design matrix, and run the model. Repeat on other models and compare the levels of support for the various models. For the model, \(phi_{snow}, p{sex}\), you’d want to achieve the following.
\[\hat\phi_t = \frac{exp^{(\hat\beta_1 + \hat\beta_2 \cdot {sp_t})}}{1 + exp^{(\hat\beta_1 + \hat\beta_2 \cdot {sp_t})}}; \qquad \hat p_{sex} = \frac{exp^{(\hat\beta_3 + \hat\beta_4 \cdot {male1})}}{1 + exp^{(\hat\beta_3 + \hat\beta_4 \cdot {male1})}}\]
For a 5-year study, you could number the PIM for \(\phi_{male}\) as
\[ \begin{matrix} 1 & 2 & 3 & 4\\ & 2 & 3 & 4\\ & & 3 & 4\\ & & & 4\\ \end{matrix} \] The PIM for \(\phi_{female}\) would be
\[ \begin{matrix} 5 & 6 & 7 & 8\\ & 6 & 7 & 8\\ & & 7 & 8\\ & & & 8\\ \end{matrix} \] And, because you have no time variation in the model for \(p\), you could simplify the PIMs such that the males would have
\[ \begin{matrix} 9 & 9 & 9 & 9\\ & 9 & 9 & 9\\ & & 9 & 9\\ & & & 9\\ \end{matrix} \]
and the female PIM for \(p\) would have
\[ \begin{matrix} 10 & 10 & 10 & 10\\ & 10 & 10 & 10\\ & & 10 & 10\\ & & & 10\\ \end{matrix} \]
From there, you’d need a design matrix with 10 rows, where the rows 1-4 are for \(\phi_{t_{male}}\), rows 5-8 are for \(\phi_{t_{female}}\), row 9 is for \(p_{male}\), and row 10 is for \(p_{female}\). There are 4 columns, which means 4 \(\beta_i\) will be estimated. The design matrix would be filled in as follows.
\[ \left[\begin{array}{cc|cc} 1 & 125 & 0 & 0\\ 1 & 254 & 0 & 0\\ 1 & 87 & 0 & 0\\ 1 & 180 & 0 & 0\\ 1 & 125 & 0 & 0\\ 1 & 254 & 0 & 0\\ 1 & 87 & 0 & 0\\ 1 & 180 & 0 & 0\\\hline 0 & 0 & 1 & 1\\ 0 & 0 & 1 & 0 \end{array}\right] \]
The R Notebook for today has you work with the linear models to make sure you see what the relationships look like between the response variable and the covariates in each model. The homework will give you experience implementing the ideas for this week in Program MARK.