---
title: "R Notebook 2 - binomial distribution"
output:
  html_document:
    df_print: paged
---
```{r setup, include=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(ggplot2)
```

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

You can execute a code chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

The activities below have you use the functions `dbinom` and `rbinom`. You should look at the help for `dbinom` in RStudio (either by submitting `?dbinom` at the cursor in the console window or by searching on `dbinom` in the Help panel). There you'll see the list of arguments for each function. Be sure you understand what the arguments $x$, $p$, $size$, and $prob$ are in the context of the 2 functions.

#### Generate random values from the binomial distribution

You will begin working with the probability question: given a probability distribution with known parameters, how likely are various outcomes. Work with the code block below to explore how the distribution of outcomes changes as you change $p$ from 0.1 to 0.5 to 0.9, while $x = 0, 1, \ldots, 30$ and $size = 30$. 

1. How does the distribution of the pdf values vs $p$ change in terms of the following features?
  + central tendency 
  + spread

```{r binomial pdf, message=FALSE, warning=FALSE, out.width = "50%"}
x.vals <- 0:30
size <- 30
prob.succ = 0.1
df <- data.frame(x = x.vals, y = dbinom(x = x.vals, size = size, p = prob.succ))
ggplot(df, aes(x = x, y = y)) + 
  geom_bar(stat = "identity", col = "gray", fill = "gray") + 
  scale_y_continuous(expand = c(0.01, 0)) + 
  xlab("") + 
  ylab("Probability Density Function") + 
  scale_x_continuous("", limits = c(0, size), breaks = seq(0, size, by = 1)) +
  theme_bw()
```

#### Generate random values from the binomial distribution

In the code block below, repeatedly execute `rbinom(1, 25, 0.65)`. 

```{r}
rbinom(1, 25, 0.65)
```


2. What are the most typical values? 
3. How variable are the outcomes?

Alter the code to repeatedly execute `rbinom(1, 250, 0.65)`. 

4. What are the most typical values? 
5. How variable are the outcomes? 

It is more efficient to do many trials all at once and summarize their outputs to explore questions 2-5. Try working with the code block below to answer the questions.

```{r}
outcomes <- rbinom(100, 25, 0.65)
table(outcomes)
mean(outcomes)
```

#### Estimate the parameters of the binomial distribution

Here, we'll obtain data on outcomes (y: 1 = survive, 0 = died) for 40 individuals using `rbinom(40, 1, 0.7)` and then analyze them using the `glm` function. For many of you this might be new. Know that we will work more with this in the coming weeks and fill in missing details. For now, notice that we can 

* obtain the $ln\mathcal{L}$ score for the model (stored as `logLik` in the list of model outputs)
* Estimate the probability of success from a model that assumes that $p$ is the same for all individuals
* Estimate our uncertainty for $\hat{p}$

In the code block, we obtain the probability of success using the `predict` function using `type = "response"` and `se.fit=TRUE`. Note that the predictions are the same for all individuals in this intercept-only model. Thus, we only choose to look at the first few predicted values and their associated SE's.

```{r}
n = 40
y = rbinom(n, 1, 0.7)
y
# check the proportion of successes
sum(y) / n
m0 <- glm(y ~ 1, family = "binomial")
# view the log-likelihood score for the model
logLik(m0)
# request the predicted probability of success and the SEs for those predictions
#  and just look at the values for the 1st few observations as they're all the same
#  in this intercept-only model
predicted <- predict(m0, type = "response", se.fit = TRUE)
predicted$fit[1:4]
predicted$se.fit[1:4]
```
 
6. How closely does the value of $\hat{p}$ match the true parameter value you used to generate the data?
7. How closely does the value of $SE\hat{p}$ match the square-root of the value you obtain from the closed-form equation for the variance ( see notes on binomial distribution and assigned reading on likelihood if you don't remember that equation).
8. How repeatable are the values in the output as you repeat the study?
9. What happens to the precision of the estimates, i.e., $SE(\hat{p})$, if you decrease the sample size to 15?
10. What happens to the precision of the estimates, i.e., $SE(\hat{p})$, if you increase the sample size to 100?