Reading and inspecting data files, simple data manipulaton and graphing

Human Population Growth

The US Census Bureau is a definitive source of comprehensive data on human population growth and the factors that affect it. The USCB has estimated near-term future trends (2015 - 2060) in national population growth, including birth rates, death rates, immigration, emigration, ‘natural increase’ (which is growth due to births and deaths in place) and population growth (accounting for immigration and emigration). These estimates are tabulated in a 2012 report. This lab will use these data to:

1.Explore projected patterns of human population growth and carrying capacity that were discussed in class, and

2.Use R for somw basic data processing and graphical analysis.

Import and inspect the data

Download and save the data file US_pop_projections.txt

First clean the memory of all objects and read in the data. You will need to revise the setwd() line to set the working directory to location that you saved the data. In the read.table() function, the sep = "" argument specifies that the variables are delimited by tabs rather than by commas, in this input file. The argument header = TRUE specifies that the first row contains variable names.

rm(list = ls())
setwd("C:/Users/g23b661/Desktop/BIOE 440 2016/BIOE_440_R_Markdown/2.Manipulating_and_using_data")

us.pop.projection <- read.table("US_pop_projections.txt", sep = "", header = TRUE)

Always inspect the data to be sure it was read in correctly and that you know what the file holds.

head(us.pop.projection)
##   year US_population annual_change percent_change natural_increase births
## 1 2015        321363          2471           0.77             1677   4290
## 2 2016        323849          2486           0.77             1669   4312
## 3 2017        326348          2499           0.77             1659   4333
## 4 2018        328857          2510           0.77             1647   4351
## 5 2019        331375          2517           0.77             1631   4367
## 6 2020        333896          2521           0.76             1612   4380
##   deaths net_immigration
## 1   2613             794
## 2   2643             817
## 3   2673             840
## 4   2704             863
## 5   2736             886
## 6   2768             909
tail(us.pop.projection)
##    year US_population annual_change percent_change natural_increase births
## 41 2055        409873          2037            0.5              825   4879
## 42 2056        411923          2051            0.5              838   4889
## 43 2057        413989          2065            0.5              852   4899
## 44 2058        416068          2079            0.5              865   4909
## 45 2059        418161          2093            0.5              879   4920
## 46 2060        420268          2106            0.5              891   4930
##    deaths net_immigration
## 41   4054            1212
## 42   4051            1213
## 43   4048            1214
## 44   4044            1214
## 45   4041            1215
## 46   4039            1215
summary(us.pop.projection)
##       year      US_population    annual_change  percent_change  
##  Min.   :2015   Min.   :321363   Min.   :1967   Min.   :0.5000  
##  1st Qu.:2026   1st Qu.:349476   1st Qu.:2007   1st Qu.:0.5000  
##  Median :2038   Median :374917   Median :2100   Median :0.5550  
##  Mean   :2038   Mean   :373172   Mean   :2204   Mean   :0.6017  
##  3rd Qu.:2049   3rd Qu.:397324   3rd Qu.:2454   3rd Qu.:0.7075  
##  Max.   :2060   Max.   :420268   Max.   :2521   Max.   :0.7700  
##  natural_increase     births         deaths     net_immigration
##  Min.   : 770.0   Min.   :4290   Min.   :2613   Min.   : 794   
##  1st Qu.: 810.2   1st Qu.:4417   1st Qu.:3016   1st Qu.:1053   
##  Median : 916.0   Median :4556   Median :3640   Median :1165   
##  Mean   :1093.2   Mean   :4599   Mean   :3506   Mean   :1110   
##  3rd Qu.:1400.8   3rd Qu.:4800   3rd Qu.:4026   3rd Qu.:1201   
##  Max.   :1677.0   Max.   :4930   Max.   :4055   Max.   :1215

Some data processing

Looking at the data, you probably noticed that all of the values for population size, numbers of births and deaths, etc. are too small (The US had about 320,000,000 of the world’s 7.2 billion people when I wrote this). This is because the US Census Bureau provides these raw data in units of 1,000 people. (The logic is that when numbers become too large, we lose our numerical intuition.) Convert by multiplying by 1000, so that the units are individuals.

year <- us.pop.projection$year
pop <- us.pop.projection$US_population *1000
annual.change <- us.pop.projection$annual_change * 1000
natural.increase <- us.pop.projection$natural_increase * 1000
births <- us.pop.projection$births * 1000
deaths <- us.pop.projection$deaths * 1000
net.immigration <- us.pop.projection$net_immigration * 1000

Now, re-assemble the newly created variables into a dataframe. This can be done with the cbind() function, which binds variables together into a single object with one column per variable. (The function rbind() would bind the variables as rows. There are several join() functions that can handle more complicated tasks.)

pop.data.new <- cbind(year, pop, annual.change, natural.increase,births,deaths, net.immigration)

View the new object to be sure it’s OK.

head(pop.data.new, 5)
##      year       pop annual.change natural.increase  births  deaths
## [1,] 2015 321363000       2471000          1677000 4290000 2613000
## [2,] 2016 323849000       2486000          1669000 4312000 2643000
## [3,] 2017 326348000       2499000          1659000 4333000 2673000
## [4,] 2018 328857000       2510000          1647000 4351000 2704000
## [5,] 2019 331375000       2517000          1631000 4367000 2736000
##      net.immigration
## [1,]          794000
## [2,]          817000
## [3,]          840000
## [4,]          863000
## [5,]          886000

There is more than one way to accomplish this task (or any task) in R. You can also use the data.frame() function to assemble the variables into a singe data frame, organized by columns.

pop.data.new.2 <- data.frame(year, pop,annual.change,natural.increase,births,deaths, net.immigration)

View the new dataframe.

head(pop.data.new.2, 5)
##   year       pop annual.change natural.increase  births  deaths net.immigration
## 1 2015 321363000       2471000          1677000 4290000 2613000          794000
## 2 2016 323849000       2486000          1669000 4312000 2643000          817000
## 3 2017 326348000       2499000          1659000 4333000 2673000          840000
## 4 2018 328857000       2510000          1647000 4351000 2704000          863000
## 5 2019 331375000       2517000          1631000 4367000 2736000          886000

Look at the top right panel in R Studio and note a subtle but important difference between pop.data.new and pop.data.new.2. The first is a matrix, and the second is a data frame. A matrix in R is just like a matrix in linear algebra (and a vector in R is just an N X 1 matrix). A matrix in R can contain data of only one type (numeric, character or logical (true/false), but usually numeric).

A data frame can hold variables of more than one type. In general, it is best to have your data in a data frame, though there are some functions in R that require the data to be in a matrix. Often, the difference does not matter, but you should be aware of it. There is also a function as.data.frame() that can be used to convert a matrix to a data frame, like so:

converted <- as.data.frame(pop.data.new)

The function as.data.frame() will only work if the variables in the objecte being converted are all numeric, or can be coerced to be numeric.

The US Census Bureau uses age-structured population models (Leslie Matrix models, which you will learn in detail later in the course) to make projections of future population trends with defined assumptions. FOr this dataset the projections go out to 2060. The further into the future a projection is made, the less certainty we have about the predictions. Suppose we want to restrict our predictions to only 25 years beyond the present. To do this, we can subset the data. Two common ways to subset data are:

  1. Using row and column index values for a vector or matrix

  2. Using the subset() function

Here’s an example of the index method - selecting the first 25 years of the variables for year and population size.

  • The index values for a vector or matrix are identifed by square brackets [].
  • The first index (before the comma) identifies the row or rows.
  • The second index (after the comma) identifes the column or columns.
  • A colon (:) can be used to specify a range of values for either rows or columns

So the code below selects the first 25 rows of the first column of pop.data.new and assigns it to a variable years, and the first 25 rows of the second column of pop.data.new and assigns it to a variable psizes.

years <- pop.data.new[1: 25, 1]
psizes <- pop.data.new[1:25, 2]

#view the new variables
years
##  [1] 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029
## [16] 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039
psizes
##  [1] 321363000 323849000 326348000 328857000 331375000 333896000 336416000
##  [8] 338930000 341436000 343929000 346407000 348867000 351304000 353718000
## [15] 356107000 358471000 360792000 363070000 365307000 367503000 369662000
## [22] 371788000 373883000 375950000 377993000

Here’s an example of the subset() method. subset() works well on dataframes but will also work on a matrix. For a dataframe, the first argument is the dataframe to be subsetted, the second argument identifies what subset of the original data to retain, and the the third argument specifies which variables to retain.

short.term <-  subset(pop.data.new.2, year < 2040, select = c(year, pop))
short.term
##    year       pop
## 1  2015 321363000
## 2  2016 323849000
## 3  2017 326348000
## 4  2018 328857000
## 5  2019 331375000
## 6  2020 333896000
## 7  2021 336416000
## 8  2022 338930000
## 9  2023 341436000
## 10 2024 343929000
## 11 2025 346407000
## 12 2026 348867000
## 13 2027 351304000
## 14 2028 353718000
## 15 2029 356107000
## 16 2030 358471000
## 17 2031 360792000
## 18 2032 363070000
## 19 2033 365307000
## 20 2034 367503000
## 21 2035 369662000
## 22 2036 371788000
## 23 2037 373883000
## 24 2038 375950000
## 25 2039 377993000

USING THE subset() FUNCTION IS USUALLY LESS PRONE TO SIMPLE BUT FRUSTRATING ERRORS The index method is trickier and more error-prone. (For example, it is very easy to be one row off by forgetting that the range 1000 - 1009 has ten entries, not nine, particularly when you’re doing the indexing as part of some more complex task.) The index method is very flexible, because R allows logical operations in subscripts. That is, you can use subscripts to select subsets of the data that meet a certain criterion, rather than just identify subsets of the data by row and column position. This usually makes it easier to do what you intended.

The code below identifies the maximum value in column 2 of pop.data.new (this column holds population sizes) and assigns it to variable x.

x <- max(pop.data.new.2[,2])
x
## [1] 420268000

Statements like this can be used to perform a wide range of data manipulation or selection. For example, the code below sums column 7 (net immigration) across all rows (years) and assigns it to variable y. It then does the same for just the first 10 years, and assigns the sum to variable z. z/y is the proportion of immigrants for the wholw period that arrive in the first 10 years.

y <- sum(pop.data.new[, 7])
y
## [1] 51083000
z <- sum(pop.data.new[pop.data.new[,1]<2025, 7])  #this will take a little inspection to understand! See below.
z
## [1] 8975000
z/y
## [1] 0.1756945

In the assignment statement for variable z, we used a logical subscript, The sum statement is applied to all rows of column 7 (net immigration) because there is no entry for the row index before the comman. But it is applied to them only when the value in column 1 (year)) is less than 2025, usng the second argument of the sum() function.

Remember that in R Studio you can put the cursor in any function and press F1 to get context-specific help about that function.

R has the following logical operators, which can be used in subscripts or in other ways (notably in if() statements, which you will see later).

R logical operators:

>    for "greater than"
>=   for "greater than or equal to"
<    for "less than"
<=   for "less than or equal to"
==   for "equal to"  (this one is a common source of confusion. Not the same as =, which is equivalent to <-)
!=   for "not equal to"
&    for "and"
|    for "or" 

Homework

Read RFDS 1, 2, 3.1-3.6 to get started with using the ggplot2 package to make plots.

Make a series of plots to better understand projected US human population growth. Some of these will require you to calculate a new variable before plotting it, using the simple data manipulation methods from this exercise

  1. Population size (Y-axis) vs year (X-axis)
  2. Population growth rate vs year
  3. Population growth rate vs population size
  4. Births per individual vs year
  5. Deaths per individual vs year
  6. Births per individual vs population size
  7. Deaths per individual vs population size

Write a one paragraph summary of what these plots reveal about the pattern of growth predicted for the US for this period, and why the growth rate is expected to change. Birth rates? Death rates? Both?