---
title: "R Notebook 1 - variance-covariance"
output:
  html_document:
    df_print: paged
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

You can execute a code chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

We'll use the `mvtnorm` package that provides tools for working with multivariate normal distributions to generate random data for 2 variables. We'll do this using the `rmvnorm` function, which requires us to minimally tell the software: 

1. how many observations of each variable we want to create (*n*), 
2. the vector of means we want to use (as many means as we have variables), and
3. the covariance matrix for the variables (a square matrix with as many rows and columns as we have variables). Be sure to understand what the values in the matrix are (variances on the diagonal, covariances on the off-diagonal)

```{r}
require(mvtnorm)
true.means = c(0, 2.5)
true.sigma = matrix(c(1.5, -1.1, 
                     -1.1,  2), 
                    nrow = 2, 
                    byrow = TRUE)

# generate 100 random values for each of the 2 variables
data.rand <- rmvnorm(100, mean = true.means, sigma = true.sigma)
# store it as a dataframe with simple names "X1" and "X2" by default
data.rand <- data.frame(data.rand)

head(data.rand)
```
#### Check the means

```{r}
# calculate the means for the 2 variables
colMeans(data.rand)
```

#### Check the variance-covariance for the 2 variables

```{r}
var(data.rand)
```

#### Plot the 2 variables
```{r}
require(GGally) # for the ggpairs function
require(ggplot2) # GGally uses the 'ggplot2' plotting package
ggpairs(data.rand, )
```

#### Calculate the correlation between the 2 variables

First, let's do the calculation using the `cov2cor` function.
```{r}
cov2cor(var(data.rand))
```

Second, make the calculation using the formula: $corr(x_1,x_2) = \frac{cov(x_1,x_2)}{\sigma_{x_1}\cdot\sigma{_{x_2}}}$

```{r}
cov(data.rand$X1, data.rand$X2) / (sd(data.rand$X1) * sd(data.rand$X2))
```

### Questions to answer with the code above 

1. How well do the estimated quantities for the means, variances, and covariances match the true values? 
2. How many different versions of the randomly generated dataset do you think you'd need to look at to be sure? 
3. How does your answer change to question 1 change if you make the sample size 10 or 1000? 
4. How does the plot of "X1" vs "X2" change if you change the value in 'true.sigma' for the covariance between the 2 variables to be -1.5 (be sure to change it in both places and to re-execute the code)?
5. How did the correlation between "X1" and "X2" change when you changed the value in 'true.sigma' for the covariance between the 2 variables to be -1.5? 
6. Re-do questions 4 and 5, using a true covariance value of 0? 
7. Re-do questions 4 and 5, using a true covariance value of 1.5?
8. Use the code block below to explore situations where you have >2 variables. Notice that some can covary more strongly than others and that some might not covary at all. Be sure to think about what you're seeing in the plots. If you change any covariance values in 'true.3sigma', be sure to change it in both places where it occurs (mirror-imaging, e.g., if you change the value in row 2, column 1, you need to also change the value in row 1, column 2).

```{r}
require(mvtnorm)
true.3means = c(0, 2.5, -2)
true.3sigma = matrix(c(1.5, -1.1, 0, 
                      -1.1,  2,   0.5,
                       0,    0.5, 3), 
                    nrow = 3, 
                    byrow = TRUE)

# generate 100 random values for each of the 3 variables
data.3rand <- rmvnorm(100, mean = true.3means, sigma = true.3sigma)
# store it as a dataframe with simple names "X1", "X2", and "X3" by default
data.3rand <- data.frame(data.3rand)

var(data.3rand)
```

```{r}
ggpairs(data.3rand, )
```