R and RStudio - Getting Started

Ecology and conservation biology are highly quantitative fields of science, and to be a good ecologist you must be skilled at working with data, conducting statistical analyses and presenting the results in an appropriate manner. The good news is that statistical methods and the associated software to do this are well-developed. The bad news is … the statistical methods and associated software are well developed, so the learning curve is steep.

Most ecologists now use the free software package R for data manipulation, statistical analysis, graphics and simulation modelling. R has several advantages:

1.It is free. Until R emerged, it was common to spend thousands of dollars on statistical software,

2.It is platform-independent, meaning that it will run in the same manner on a Mac, a PC running windows, or a PC running linux. This makes it simple to share files for analysis with colleagues (and students)… if you email a file to another person, they can run it and replicate EXACTLY what you did.

3.It is not just a statistical/graphics package, it is programming language, based on the commercial language S, which is where it got its name. Consequently, there is very little that you cannot do with R once you know the language. For example, in this course you’ll use R to manipulate and plot data, conduct statistical tests, implement highly specific models developed for ecologists (such as mark-recapture models) and construct and run fairly complex stochastic simulations of population dynamics.

4.It is code-driven rather than menu-driven. Many of you will be familiar with using menu-clicks to work with data (either to produce graphs or run statistical tests), for example in MS Excel. There is nothing wrong with this, but it can be very difficult to replicate once an analysis has many steps. Once a complicated, menu-driven analysis has been completed, it can be virtually impossible to replicate without detailed notes, even for the person who did it. IN R, you write lines of code (a “script”) to open a data file, process it, conduct analysis and make plots. Once the script is saved, it is a permanent record of EXACTLY how the analysis was done. In practice, this is very useful because it allows (or perhaps forces!) you to:

5.Because of points 1 - 4, there is a lot of pre-existing R code available in two forms:

6.For ecologists in particular, most newly-developed methods are usually implemented with R and are provided as a package with documentation. In recent years, using these packages is usually the most efficient and effective way to apply recently-developed ecological methods. This class will require the use of several such packages:

Along with these advantages are two principal disadvantages:

1.R has a steeper learning curve than most menu-driven software, especially at the outset, and particularly if you’ve never written computer code before.

2.As with any computer language, R does EXACTLY what you tell it to do, so minor errors of logic, syntax, spelling and capitalization will keep an otherwise functional script from running. Yuo can raed this but R cant.

Download and Install R

All computers in MSU labs have R (and R Studio) installed. Most of you will want to use these on your own computer, and as mentioned above, R will run on a Mac (running OS X) or a PC (running either windows or linux). If you haven’t already, you’ll have to download and install R. R is available from CRAN, the Comprehensive R Archive Network. Click the link, then at CRAN select the appropriate version for your computer from the top box, and follow instructions to download and install it.

Download and Install R Studio

I use R Studio to run R, and this class will assume that you’re using it, because the authors of RFDS are two of the main developers of R Studio (and the ‘tidyverse’ R packages). click the link to go to the R Studio site, then click the download link and follow instructions to install it. Once installed, you just start R Studio to run R. I prefer R Studio because it provides 4 windows that help you organize your work, avoid errors, and work efficiently.

Top left window - source code editor, where you write a script and save it. Unlike the graphical user interface (or GUI) in R, this editing window:

Bottom right window has several tabs, including:

Bottom left window is what would be called the console in basic R. This is were lines of code are actually executed. In R studio, +position the cursor on a line of code in the editor (top left window) then press CTRL-R to execute just that line. Executing code one line at a time can be very helpful in identifying and solving problems that keep an entire script from running properly (or at all). +highlight a block of code in the editor (top left window) and press CTRL-R to execute the block. +put the cursor anywhere in the editor, press CTRL-A to select the entire script, then press CTRL-R to run it.

As the script runs, you will see each line of code (in blue) echoed on the console as it is executed, output from the script (in black) and error messages (in red). Plots will pop up in the bottom right window, and you can scroll through them once the script is done running (with the R and L arrows).

Top right window has two tabs:

Sources of Help for R

  1. The Cookbook for R website is has a nice index of example scripts, with explanations, for all basic operations like importing data, manipulating it, making graphs, etc. This is a very useful resource for new users.
  2. The Quick-R website is similar to the cookbook. I prefer the cookbook site but they are both good.
  3. The R homepage has links to manuals, reference cards, webpages and user-groups. Authoritative, but not as user-friendly as the first two.
  4. As an MSU student youc an download a free pdf copy of the complete book “A Beginner’s Guide to R” using MSU’s SpringerLink connection, as long as you are on an MSU-domain computer. (click the ‘download book’ link to get the entire thing as a pdf and save it on a jumpstick). The Introduction (pages 1-27) covers the material discussed here. Anyway, it is an alternative to RFDS that relies only on base R functions rather than the tidyverse packages… which is getting rarer all the time.
  5. For example code you can just use google, searching for something like “r change axis range”. Just include R and the thing you want to do and you’ll often find a good example that solves your problem, with code provided. Go ahead and google the example above and follow a few of the links. As you become more proficient, you’ll rely heavily on this.
  6. More specifically, you can go to stackoverflow.com and search the site with R included in your search term. Any number of serious computer programmers post on that forum and it can be very useful. Don’t use the ‘ask question’ button unless you’ve already searched for an existing answer. Virtually EVERYTHING has already been asked and answered more than once. Stackoverflow is great but they do not believe the saying "there’s no such thing as a stuoid question’.
  7. Specifically for statistical analysis in R, the UCLA statistical consulting department has truly excellent explanations with great example code for many common analyses.

First Steps

The html files that explain R scripts in this class will always have the same formatting for three things.

  1. Explanatory text in these files is not boxed.
  2. R code appears in grey blocks. You can cut and paste the code into the R Studio script editor to build your own scripts and run them. Comments (which are not executable code) are identified by a # at the start of the comment and appear in green. Lines of functional code use blue for numeric and logical (TRUE/FALSE) values, green for character strings,and black for everything else.
  3. Output from R appears in a box with no background color, with ## at the start of the line. These boxes show you the output that R sends to the console window when you run the code box that precedes it.
#This is a block of R code. It has the same color coding that you'd see in R Studio's script editor.  

x <- c(0,1,2,3)   #this assigns some numeric values to a variable named x.
mean(x)           #this calculates the mean of the values in x, and will cause the mean to be displayed in console output.
## [1] 1.5

Installing and loading packages

You have to download and install a package before you can load it so that the functions within the package are available. On you computer, now install a few packages that you’ll be using later for the course.

  1. In the bottom right window, click the Packages tab. The window will show a list of the packages that are installed. You can use the Update icon to ensure you have the most recent versions from CRAN.
  2. Click the Install icon. A window pops up with three boxes. Leave the first (repository) on the default. Leave the last (directory path) on the default unless you installed R somewhere other than the default location. In the middle window, begin typing the name of the package you’d like to install, and list of packages will appear. Select the one you want and click install, leaving the ‘install dependencies’ box checked (this ensures that packages required by the selected package will all be installed together – many R packages require functions that are defined in some other package.)
  3. Use this process to install the packages popbio, unmarked, RMark, ggplot2, msm, lme4, MASS, MuMIn, car, class and gplots. (Capitalization matters when typing these package names.)
  4. Once a package has been installed, you have to load it to make use of the functions in the package. You can do this in R Studio by clicking the box next to the name of an installed package, but it’s better to use the library() function within the script that will use the package.

Load the ggplot2 package, which provides functions for pretty graphics, and then use the qplot function within that package to make a graph of two variables named length and height.

#load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
#create some variables and assign values
length <- c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)
height <- c(1,2,3,1,3,4,2,2,5,4,5,6,4,6,7)
species <- c('cat','cat','cat','cat','dog','cat','dog','cat','dog','dog','dog','cat','dog','dog','dog')

#make a graph with the qplot function within the ggplot2 package
qplot(length,height, colour = species)

Often in R people use shortcuts that make writing the script faster, when there are equivalent methods that make the logic more clear. For example, the code below produces an identical graph. By default [as in the example above] the first variable name put into the qplot function is the abscissa (x) and the second variable name specified is the ordinate (y), but you can also choose to spell this out [as in the example below]. Remember that you can copy and paste code into the script editor and run it from there with CTRL-R, and that you can put the cursor on a function in the script editor and press F1 to get context-specific help that explains the arguments (the terms within the function’s parentheses) of the function. This is a good way to learn the conventions (like the first variable being assumed to be x, in the case of qplot) and the default values that will be used if you don’t specify otherwise.

qplot(x=length,y=height,colour = species)

Some General Points About R Scripts

There is often more than one way to accomplish the same thing in R. Instead of loading the ggplot2 package and using its qplot() function, I could have just used the plot() function that is already provided in the base R package.

plot(length,height)

R is sensitive to capitalization, so the package name MuMIn is not the same thing as mumin or MUMIN. This is also true for variable names: var1 and VAR1 are two different variables as far as R is concerned.

var1 <- 2
VAR1 <- 2000
var1
## [1] 2
VAR1
## [1] 2000

In the above code block, the <- assigns a value to a variable, and then typing the name of the variable causes its value to be displayed on the console. Alternatively, you can use = instead of <-.

var2 = 50
var2
## [1] 50

I’d recommend using <- to assign values, and thinking of it as assign and not equals. This will prevent any possible confusion with the R code for “is equal to” which is “==” This will be important later.

R ignores spaces between the items within a line, so it is not a problem to have extra spaces between items. You cannot have spaces within an item. For example, Note a space between the < and the - will not be recognized as an assignment statement.

var2 <-              3
var2
## [1] 3

R also allows you to type a single line of code on multiple lines without creating a problem. If you run the code below (cut and paste it into the editor, then use CTRL-A & CTRL-R), you’ll see that both assignment statements work fine. If you look at the console after it assigns the values to vector2, you’ll notice that the R console displays a > at the beginning of each new line, but displays a + instead of > when it is continuing a line of code rather than starting a new one. When you are debugging code that won’t run, it’s useful to check whether the console is displaying a plus sign, which indicates that it encountered a problem within the last line of code before the error and couldn’t finish executing that command.

vector1 <- c(1,2,3,4,5)
vector2 <-
          
      c(6,7,8,9,10)

vector1
## [1] 1 2 3 4 5
vector2
## [1]  6  7  8  9 10

R cannot deal with spaces within a variable name: it will treat the two parts as separate entities. A common convention is to use periods (dots) as a spacer within a variable name.

var.named.joe <- "Hi I'm Joe" 
var.named.joe
## [1] "Hi I'm Joe"

The above command successfully assigns the character string “Hi I’m Joe” to the variable var.named.joe.

#var named joe <- "oops"

would give an error message because of the spaces within the variable name. (Here, I have this line ‘commented out’ so that this script will run without errors.) Note that spaces within the character string stored in the variable var.named.joe are treated like any other character, unlike the spaces within the variable name itself. We’ll discuss the differences between text strings and numerical values below.

Variable names: * cannot have spaces within them. * cannot start with a number
* cannot contain a $ (because $ is used to separate the name of a data frame and a variable within that data frame… more on this next session.) * cannot contain any symbols used for mathematical operations in R.

Putting a # at the start of a line turns it into a comment. Anything following the # will be displayed but will not be executed as code. This provides a way to annotate code or to disable a line of code without deleting it, which can be useful when debugging.

# this is a comment, explaining that the next line is a functional assignment statement that uses the function
# seq() to assign a sequence of values from 0 to 1 by units 0.1 to a variable named vector1.

vector1 <- seq(0,1,0.1)

#this is a comment noting that the next line of code is commented out and therefore doesn't run

#vector2 <- seq(0,1,0.1)

vector1
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
vector2
## [1]  6  7  8  9 10

You may have noticed that this code block re-used the variable names vector1 and vector2. vector1 used to hold the values 1,2,3,4,5, but now it holds the values 0, 0.1, 0.2 … 0.9, 1. vector2 still holds its original values (6,7,8,9,10) because the new assignment is commented out. When writing longer scripts, it’s important to remember that R stores only the last assignment of a variable. The Environment tab in the top right window is useful when you have any confusion about what values are currently stored in a variable.

In general, avoid reusing variable names within script unless doing so for some intentional reason.

Also, do not have multiple scripts open at once unless it is for a good reason; if they have variable names in common it can create unanticipated problems. If you have two or more scripts open (perhaps to copy an example code block), recall that you can use the clear button (broom icon) in the Environment tab of the top right window to clear all memory out and work with a clean slate.

Quotes are used to specify text, or character strings as text variables are called in R. It does not matter if they are single or double quotes.

# this stores the value 12 in a numeric variable 
var.num <- 12      
summary(var.num)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      12      12      12      12      12      12
# this stores the character string "12", essentially as a word rather than a number.  You can't do math on it.
var.char = "12"     
summary(var.char)
##    Length     Class      Mode 
##         1 character character