R and RStudio - Getting Started

Ecology and conservation biology are highly quantitative fields of science, and to be a good ecologist you must be skilled at working with data, conducting statistical analyses and presenting the results in an appropriate manner. The good news is that statistical methods and the associated software to do this are well-developed. The bad news is … the statistical methods and associated software are well developed, so the learning curve is steep.

Most ecologists now use the free software package R for data manipulation, statistical analysis, graphics and simulation modelling. R has several advantages:

1.It is free. Until R emerged, it was common to spend thousands of dollars on statistical software,

2.It is platform-independent, meaning that it will run in the same manner on a Mac, a PC running windows, or a PC running linux. This makes it simple to share files for analysis with colleagues (and students)… if you email a file to another person, they can run it and replicate EXACTLY what you did.

3.It is not just a statistical/graphics package, it is programming language, based on the commercial language S, which is where it got its name. Consequently, there is very little that you cannot do with R once you know the language. For example, in this course you’ll use R to manipulate and plot data, conduct statistical tests, implement highly specific models developed for ecologists (such as mark-recapture models) and construct and run fairly complex stochastic simulations of population dynamics.

4.It is code-driven rather than menu-driven. Many of you will be familiar with using menu-clicks to work with data (either to produce graphs or run statistical tests), for example in MS Excel. There is nothing wrong with this, but it can be very difficult to replicate once an analysis has many steps. Once a complicated, menu-driven analysis has been completed, it can be virtually impossible to replicate without detailed notes, even for the person who did it. IN R, you write lines of code (a “script”) to open a data file, process it, conduct analysis and make plots. Once the script is saved, it is a permanent record of EXACTLY how the analysis was done. In practice, this is very useful because it allows (or perhaps forces!) you to:

write methods up for publication with no ambiguity
modify analyses
recognize and consider the assumptions of the analysis you’ve conducted, and better understand the results
replicate analyses with other data sets

5.Because of points 1 - 4, there is a lot of pre-existing R code available in two forms:

Packages are self-contained extensions of R that you install once and then load when you want to make use of the functions that they provide. (As an example, I used the R package markdown to create the formatted html files with R examples for this class, including the one you’re reading now.)
Example Code can often be found – more on this below. Starting from a working example and modifiying the code is often the easiest way to solve a problem that is new to you, but not to others. Once you have a set of R scripts of your own, you’ll often cut and past code then make a few modifications, rather than starting from scratch.

6.For ecologists in particular, most newly-developed methods are usually implemented with R and are provided as a package with documentation. In recent years, using these packages is usually the most efficient and effective way to apply recently-developed ecological methods. This class will require the use of several such packages:

unmarked to estimate population density with distance sampling models that account for factors that affect the density of a species and the probability of detecting it.
RMark to estimate survival rates and the factors that affect them, using capture-mark-recapture methods. Using RMark is a very time-efficient way to implement models that are otherwise most readily available in the FORTRAN program MARK, which can be quite difficult to master.
popbio to derive information on population growth and the factors that affect it, and to build simulation models that estimate extinction risk (population viability analysis)
more general packages such as glm or lme4 that can accomplish complicated but appropriate statistical analyses such a generalized mixed-effects models. This is not a STAT class but it’s impossible to conduct research in ecology without getting into statistics.

Along with these advantages are two principal disadvantages:

1.R has a steeper learning curve than most menu-driven software, especially at the outset, and particularly if you’ve never written computer code before.

2.As with any computer language, R does EXACTLY what you tell it to do, so minor errors of logic, syntax, spelling and capitalization will keep an otherwise functional script from running. Yuo can raed this but R cant.

Download and Install R

All computers in MSU labs have R (and R Studio) installed. Most of you will want to use these on your own computer, and as mentioned above, R will run on a Mac (running OS X) or a PC (running either windows or linux). If you haven’t already, you’ll have to download and install R. R is available from CRAN, the Comprehensive R Archive Network. Click the link, then at CRAN select the appropriate version for your computer from the top box, and follow instructions to download and install it.

Download and Install R Studio

I use R Studio to run R, and this class will assume that you’re using it, because the authors of RFDS are two of the main developers of R Studio (and the ‘tidyverse’ R packages). click the link to go to the R Studio site, then click the download link and follow instructions to install it. Once installed, you just start R Studio to run R. I prefer R Studio because it provides 4 windows that help you organize your work, avoid errors, and work efficiently.

Top left window - source code editor, where you write a script and save it. Unlike the graphical user interface (or GUI) in R, this editing window:

uses color coding to distinguish numbers (blue), executable code (black), and comments (green - comments are text that is not executable code, and is identified by putting a # at the start of a line).
has auto-completion of inherently paired items like parentheses, square brackets and quotes, so it is less likely that you will forget to ‘close’ these properly, which is a common error. Also shows you (with gray highlighting) the pairing of these items (incorrect use of parentheses is also a common error in R scripts). Even with autocompletion, you have to be careful about pairing.
has a tab-completion tool. One of the most common errors in an R script is a typo in the name of a variable. R is completely literal, so a typo in the name of a variable or a function means it is simply not recognized. Tab-completion helps to avoid this. For any variable (or other item) that has been stored in memory by prior lines of, code, you can begin typing the name and then press the tab key. If the characters you’ve typed identify the variable uniquely, it will be inserted without you having to type it out. If there is more than one item in memory that starts with the letters you’ve typed, a little window will pop up and you can click the one you want for auto-completion. For various reasons, variable names are often long (e.g. “lion.surv.p.dot.phi.time”), so typing them out can be slow and error prone.
F1 context-sensitive help. Put the cursor within any function in the editing window and press F1, and help for that function (including example code) pops up in the bottom right window. THIS IS VERY USEFUL WHEN LEARNING R. I USE IT OFTEN WHEN DEALING WITH A NEW PACKAGE OR FUNCTION

Bottom right window has several tabs, including:

As mentioned, there is a Help window. Use the F1 key in the top left editing window to get help on a specific function, or type the name in the search bar of the help window. This is very valuable even for functions you understand well, to refresh your memory of the syntax required, to learn the arguments that you can specify within a function, and to learn the default values for arguments).
Plots stores and displays graphics as they are made. Scroll through with R and L arrows, delete them individually or all at once.
A Packages window that is helpful to see what packages you’ve downloaded and installed, and to load them, though you should generally do this using the library() function within your script.

Bottom left window is what would be called the console in basic R. This is were lines of code are actually executed. In R studio, +position the cursor on a line of code in the editor (top left window) then press CTRL-R to execute just that line. Executing code one line at a time can be very helpful in identifying and solving problems that keep an entire script from running properly (or at all). +highlight a block of code in the editor (top left window) and press CTRL-R to execute the block. +put the cursor anywhere in the editor, press CTRL-A to select the entire script, then press CTRL-R to run it.

As the script runs, you will see each line of code (in blue) echoed on the console as it is executed, output from the script (in black) and error messages (in red). Plots will pop up in the bottom right window, and you can scroll through them once the script is done running (with the R and L arrows).

Top right window has two tabs:

Environment shows the variables and values that have been stored in memory by the script. This is helpful when debugging errors. When you’re running lines from a script, it is often true that one line depends on variables created by prior lines, so the original source of errors can be a little tricky to diagnose. A good plan for debugging a script with problems is to: Start at the first line of code with nothing in memory (use the little broom icon labelled ‘Clear’ in the top right window) and nothing in the console (put the cursor in the console window [bottom left] and press CTRL-L), execute single lines or small blocks of code in order from the start of the script, and check the console and environment tab to see if you have error messages and if the variables have values that make sense.
History shows the lines of code that have been run. I don’t use this tab much.

Sources of Help for R

The Cookbook for R website is has a nice index of example scripts, with explanations, for all basic operations like importing data, manipulating it, making graphs, etc. This is a very useful resource for new users.
The Quick-R website is similar to the cookbook. I prefer the cookbook site but they are both good.
The R homepage has links to manuals, reference cards, webpages and user-groups. Authoritative, but not as user-friendly as the first two.
As an MSU student youc an download a free pdf copy of the complete book “A Beginner’s Guide to R” using MSU’s SpringerLink connection, as long as you are on an MSU-domain computer. (click the ‘download book’ link to get the entire thing as a pdf and save it on a jumpstick). The Introduction (pages 1-27) covers the material discussed here. Anyway, it is an alternative to RFDS that relies only on base R functions rather than the tidyverse packages… which is getting rarer all the time.
For example code you can just use google, searching for something like “r change axis range”. Just include R and the thing you want to do and you’ll often find a good example that solves your problem, with code provided. Go ahead and google the example above and follow a few of the links. As you become more proficient, you’ll rely heavily on this.
More specifically, you can go to stackoverflow.com and search the site with R included in your search term. Any number of serious computer programmers post on that forum and it can be very useful. Don’t use the ‘ask question’ button unless you’ve already searched for an existing answer. Virtually EVERYTHING has already been asked and answered more than once. Stackoverflow is great but they do not believe the saying "there’s no such thing as a stuoid question’.
Specifically for statistical analysis in R, the UCLA statistical consulting department has truly excellent explanations with great example code for many common analyses.

First Steps

The html files that explain R scripts in this class will always have the same formatting for three things.

Explanatory text in these files is not boxed.
R code appears in grey blocks. You can cut and paste the code into the R Studio script editor to build your own scripts and run them. Comments (which are not executable code) are identified by a # at the start of the comment and appear in green. Lines of functional code use blue for numeric and logical (TRUE/FALSE) values, green for character strings,and black for everything else.
Output from R appears in a box with no background color, with ## at the start of the line. These boxes show you the output that R sends to the console window when you run the code box that precedes it.

#This is a block of R code. It has the same color coding that you'd see in R Studio's script editor.  

x <- c(0,1,2,3)   #this assigns some numeric values to a variable named x.
mean(x)           #this calculates the mean of the values in x, and will cause the mean to be displayed in console output.

## [1] 1.5

Installing and loading packages

You have to download and install a package before you can load it so that the functions within the package are available. On you computer, now install a few packages that you’ll be using later for the course.

In the bottom right window, click the Packages tab. The window will show a list of the packages that are installed. You can use the Update icon to ensure you have the most recent versions from CRAN.
Click the Install icon. A window pops up with three boxes. Leave the first (repository) on the default. Leave the last (directory path) on the default unless you installed R somewhere other than the default location. In the middle window, begin typing the name of the package you’d like to install, and list of packages will appear. Select the one you want and click install, leaving the ‘install dependencies’ box checked (this ensures that packages required by the selected package will all be installed together – many R packages require functions that are defined in some other package.)
Use this process to install the packages popbio, unmarked, RMark, ggplot2, msm, lme4, MASS, MuMIn, car, class and gplots. (Capitalization matters when typing these package names.)
Once a package has been installed, you have to load it to make use of the functions in the package. You can do this in R Studio by clicking the box next to the name of an installed package, but it’s better to use the library() function within the script that will use the package.

Load the ggplot2 package, which provides functions for pretty graphics, and then use the qplot function within that package to make a graph of two variables named length and height.

#load the ggplot2 package
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.2

#create some variables and assign values
length <- c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)
height <- c(1,2,3,1,3,4,2,2,5,4,5,6,4,6,7)
species <- c('cat','cat','cat','cat','dog','cat','dog','cat','dog','dog','dog','cat','dog','dog','dog')

#make a graph with the qplot function within the ggplot2 package
qplot(length,height, colour = species)

Often in R people use shortcuts that make writing the script faster, when there are equivalent methods that make the logic more clear. For example, the code below produces an identical graph. By default [as in the example above] the first variable name put into the qplot function is the abscissa (x) and the second variable name specified is the ordinate (y), but you can also choose to spell this out [as in the example below]. Remember that you can copy and paste code into the script editor and run it from there with CTRL-R, and that you can put the cursor on a function in the script editor and press F1 to get context-specific help that explains the arguments (the terms within the function’s parentheses) of the function. This is a good way to learn the conventions (like the first variable being assumed to be x, in the case of qplot) and the default values that will be used if you don’t specify otherwise.

qplot(x=length,y=height,colour = species)

Some General Points About R Scripts

There is often more than one way to accomplish the same thing in R. Instead of loading the ggplot2 package and using its qplot() function, I could have just used the plot() function that is already provided in the base R package.

plot(length,height)

R is sensitive to capitalization, so the package name MuMIn is not the same thing as mumin or MUMIN. This is also true for variable names: var1 and VAR1 are two different variables as far as R is concerned.

var1 <- 2
VAR1 <- 2000
var1

## [1] 2

VAR1

## [1] 2000

In the above code block, the <- assigns a value to a variable, and then typing the name of the variable causes its value to be displayed on the console. Alternatively, you can use = instead of <-.

var2 = 50
var2

## [1] 50

I’d recommend using <- to assign values, and thinking of it as assign and not equals. This will prevent any possible confusion with the R code for “is equal to” which is “==” This will be important later.

R ignores spaces between the items within a line, so it is not a problem to have extra spaces between items. You cannot have spaces within an item. For example, Note a space between the < and the - will not be recognized as an assignment statement.

var2 <-              3
var2

## [1] 3

R also allows you to type a single line of code on multiple lines without creating a problem. If you run the code below (cut and paste it into the editor, then use CTRL-A & CTRL-R), you’ll see that both assignment statements work fine. If you look at the console after it assigns the values to vector2, you’ll notice that the R console displays a > at the beginning of each new line, but displays a + instead of > when it is continuing a line of code rather than starting a new one. When you are debugging code that won’t run, it’s useful to check whether the console is displaying a plus sign, which indicates that it encountered a problem within the last line of code before the error and couldn’t finish executing that command.

vector1 <- c(1,2,3,4,5)
vector2 <-
          
      c(6,7,8,9,10)

vector1

## [1] 1 2 3 4 5

vector2

## [1]  6  7  8  9 10

R cannot deal with spaces within a variable name: it will treat the two parts as separate entities. A common convention is to use periods (dots) as a spacer within a variable name.

var.named.joe <- "Hi I'm Joe" 
var.named.joe

## [1] "Hi I'm Joe"

The above command successfully assigns the character string “Hi I’m Joe” to the variable var.named.joe.

#var named joe <- "oops"

would give an error message because of the spaces within the variable name. (Here, I have this line ‘commented out’ so that this script will run without errors.) Note that spaces within the character string stored in the variable var.named.joe are treated like any other character, unlike the spaces within the variable name itself. We’ll discuss the differences between text strings and numerical values below.

Variable names: * cannot have spaces within them. * cannot start with a number
* cannot contain a $ (because $ is used to separate the name of a data frame and a variable within that data frame… more on this next session.) * cannot contain any symbols used for mathematical operations in R.

Putting a # at the start of a line turns it into a comment. Anything following the # will be displayed but will not be executed as code. This provides a way to annotate code or to disable a line of code without deleting it, which can be useful when debugging.

# this is a comment, explaining that the next line is a functional assignment statement that uses the function
# seq() to assign a sequence of values from 0 to 1 by units 0.1 to a variable named vector1.

vector1 <- seq(0,1,0.1)

#this is a comment noting that the next line of code is commented out and therefore doesn't run

#vector2 <- seq(0,1,0.1)

vector1

##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

vector2

## [1]  6  7  8  9 10

You may have noticed that this code block re-used the variable names vector1 and vector2. vector1 used to hold the values 1,2,3,4,5, but now it holds the values 0, 0.1, 0.2 … 0.9, 1. vector2 still holds its original values (6,7,8,9,10) because the new assignment is commented out. When writing longer scripts, it’s important to remember that R stores only the last assignment of a variable. The Environment tab in the top right window is useful when you have any confusion about what values are currently stored in a variable.

In general, avoid reusing variable names within script unless doing so for some intentional reason.

Also, do not have multiple scripts open at once unless it is for a good reason; if they have variable names in common it can create unanticipated problems. If you have two or more scripts open (perhaps to copy an example code block), recall that you can use the clear button (broom icon) in the Environment tab of the top right window to clear all memory out and work with a clean slate.

Quotes are used to specify text, or character strings as text variables are called in R. It does not matter if they are single or double quotes.

# this stores the value 12 in a numeric variable 
var.num <- 12      
summary(var.num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      12      12      12      12      12      12

# this stores the character string "12", essentially as a word rather than a number.  You can't do math on it.
var.char = "12"     
summary(var.char)

##    Length     Class      Mode 
##         1 character character