Tutorial 1: Getting Started

DPI R Bootcamp

Jared Knowles

Overview

R

What Does it Look Like?

The R workspace in RStudio

A Bit of HistoRy

The Philosophy

John Chambers, in describing the logic behind the S language said:

[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.

R is Born

Why Use R

Thoughts on Free

  1. R can be run and used for any purpose, commercial or non-commercial, profit or not-for-profit
  2. R’s source code is freely available so you can study how it works and adapt it to your needs.
  3. R is free to redistribute so you can share it with your enemies friends
  4. R is free to modify and those modifications are free to redistribute and may be adopted by the rest of the community!

R Advantages Continued

R Can Compliment Other Tools

R’s Drawbacks

R’s Popularity

R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS

R Has an Active Web Presence

R is linked to from more and more sites

R Extensions

These links come from the explosion of add-on packages to R

R Has an Active Community

Usage of the R listserv for help has really exploded recently

Data from Bob Muenchen available online

R Vocabulary

Components of an R Setup

Advanced R Setup

Open Source Toolchain

Some Notes about Maintaining R

Self-help

Self-help (2)

foo <- c(1, "b", 5, 7, 0)
bar <- c(1, 2, 3, 4, 5)
foo + bar
Error: non-numeric argument to binary operator

Let’s Open RStudio

The Data Frame

data(mtcars)
mtcars
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

A Data Frame

R As A Calculator

2 + 2  # add numbers
[1] 4
2 * pi  #multiply by a constant
[1] 6.283
7 + runif(1, min = 0, max = 1)  #add a random variable
[1] 7.679
4^4  # powers
[1] 256
sqrt(4^4)  # functions
[1] 16

Arithmetic Operators

2 + 2
[1] 4
2/2
[1] 1
2 * 2
[1] 4
2^2
[1] 4
2 == 2
[1] TRUE
23%/%2
[1] 11
23%%2
[1] 1

Other Key Symbols

foo <- 3
foo
[1] 3
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
# it increments by one
a <- 100:120
a
 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
[18] 117 118 119 120

Comments in R

# Something I want to keep from R
# Like my secret from the R engine
# Maybe intended for a human and not the computer
# Like: Look at this cool plot!

myplot(readSS,mathSS,data=df)

R Advanced Math

Using the Workspace

Using the Workspace (2)

x <- 5  #store a variable with <-
x  #print the variable
[1] 5
z <- 3
ls()  #list all variables
[1] "x" "z"
ls.str()  #list and describe variables
x :  num 5
z :  num 3
rm(x)  # delete a variable
ls()
[1] "z"

R as a Language

  1. Case sensitivity matters
a <- 3
A <- 4
print(c(a, A))
[1] 3 4
  1. What happens if I type print(a,A)?

c is our friend

A <- c(3, 4)
print(A)
[1] 3 4

Language

a <- runif(100)  # Generate 100 random numbers
b <- runif(100)  # 100 more
c <- NULL  # Setup for loop (declare variables)
for (i in 1:100) {
    # Loop just like in Java or C
    c[i] <- a[i] * b[i]
}
d <- a * b
identical(c, d)  # Test equality
[1] TRUE

More Language Bugs Features

Objects

summary(df[, 28:31])  #summary look at df object
   schoollow         readSS        mathSS           proflvl    
 Min.   :0.000   Min.   :252   Min.   :210   advanced   : 788  
 1st Qu.:0.000   1st Qu.:430   1st Qu.:418   basic      : 523  
 Median :0.000   Median :495   Median :480   below basic: 210  
 Mean   :0.242   Mean   :496   Mean   :483   proficient :1179  
 3rd Qu.:0.000   3rd Qu.:562   3rd Qu.:543                     
 Max.   :1.000   Max.   :833   Max.   :828                     
summary(df$readSS)  #summary of a single column
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    252     430     495     496     562     833 

-The $ says to look for object readSS in object df

Graphics too

library(ggplot2) # Load graphics Package
qplot(readSS,mathSS,data=df,geom='point',alpha=I(0.3))+theme_dpi()+
  opts(title='Test Score Relationship')+
  geom_smooth()
Error: could not find function "theme_dpi"

Handling Data in R

length(unique(df$school))
[1] 173
length(unique(df$stuid))
[1] 1200
uniqstu <- length(unique(df$stuid))
uniqstu
[1] 1200

Special Operators

big <- c(9, 12, 15, 25)
small <- c(9, 3, 4, 2)
# Give us a nice vector of logical values
big > small
[1] FALSE  TRUE  TRUE  TRUE
big = small
# Oops--don't do this, reassigns big to small
print(big)
[1] 9 3 4 2
print(small)
[1] 9 3 4 2

Special Operators II

big <- c(9, 12, 15, 25)
big[big == small]
[1] 9
# Returns values where the logical vector is true
big[big > small]
[1] 12 15 25
big[big < small]  # Returns an empty set
numeric(0)

Special operators (III)

big <- c(9, 12, 15, 25)
small <- c(9, 12, 15, 25, 9, 1, 3)
big[small %in% big]
[1]  9 12 15 25 NA
big[big %in% small]
[1]  9 12 15 25

Special operators (IV)

foo <- c("a", NA, 4, 9, 8.7)
!is.na(foo)  # Returns TRUE for non-NA
[1]  TRUE FALSE  TRUE  TRUE  TRUE
class(foo)
[1] "character"
a <- foo[!is.na(foo)]
a
[1] "a"   "4"   "9"   "8.7"
class(a)
[1] "character"

Special operators (V)

zap <- c(1, 4, 8, 2, 9, 11)
zap[zap > 2 | zap < 8]
[1]  1  4  8  2  9 11
zap[zap > 2 & zap < 8]
[1] 4

Regular Expressions

R Data Modes

Data Modes in R (numeric)

is.numeric(A)
[1] TRUE
class(A)
[1] "numeric"
print(A)
[1] 3 4

Data Modes (Character)

b <- c("one", "two", "three")
print(b)
[1] "one"   "two"   "three"
is.numeric(b)
[1] FALSE

Data Modes (Logical)

c <- c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)
is.numeric(c)
[1] FALSE
is.character(c)
[1] FALSE
is.logical(c)  # Results in a logical value
[1] TRUE

Easier way

class(A)
[1] "numeric"
class(b)
[1] "character"
class(c)
[1] "logical"

A Note on Vectors

Factor

myfac <- factor(c("basic", "proficient", "advanced", "minimal"))
class(myfac)
[1] "factor"
myfac  # What order are the factors in?
[1] basic      proficient advanced   minimal   
Levels: advanced basic minimal proficient

Ordering the Factor

myfac_o <- ordered(myfac, levels = c("minimal", "basic", "proficient", "advanced"))
myfac_o
[1] basic      proficient advanced   minimal   
Levels: minimal < basic < proficient < advanced
summary(myfac_o)
   minimal      basic proficient   advanced 
         1          1          1          1 

Reclassifying Factors

class(myfac_o)
[1] "ordered" "factor" 
unclass(myfac_o)
[1] 2 3 4 1
attr(,"levels")
[1] "minimal"    "basic"      "proficient" "advanced"  
defac <- unclass(myfac_o)
defac
[1] 2 3 4 1
attr(,"levels")
[1] "minimal"    "basic"      "proficient" "advanced"  

Defactor

defac <- function(x) {
    x <- as.character(x)
    x
}
defac(myfac_o)
[1] "basic"      "proficient" "advanced"   "minimal"   
defac <- defac(myfac_o)
defac
[1] "basic"      "proficient" "advanced"   "minimal"   

Convert to Numeric?

myfac_o
[1] basic      proficient advanced   minimal   
Levels: minimal < basic < proficient < advanced
as.numeric(myfac_o)
[1] 2 3 4 1
myfac
[1] basic      proficient advanced   minimal   
Levels: advanced basic minimal proficient
as.numeric(myfac)
[1] 2 4 1 3

Dates

mydate <- as.Date("7/20/2012", format = "%m/%d/%Y")
# Input is a character string and a parser
class(mydate)  # this is date
[1] "Date"
weekdays(mydate)  # what day of the week is it?
[1] "Friday"
mydate + 30  # Operate on dates
[1] "2012-08-19"

More Dates

# We can parse other formats of dates
mydate2 <- as.Date("8-5-1988", format = "%d-%m-%Y")
mydate2
[1] "1988-05-08"

mydate - mydate2
Time difference of 8839 days
# Can add and subtract two date objects

A few notes on dates

as.numeric(mydate)  # days since 1-1-1970
[1] 15541
as.Date(56, origin = "2013-4-29")  # we can set our own origin
[1] "2013-06-24"

Other Classes

b <- rnorm(5000)
c <- runif(5000)
a <- b + c
mymod <- lm(a ~ b)
class(mymod)
[1] "lm"

Why care so much about classes?

Data Structures in R

Vectors

print(1)
[1] 1
# The 1 in braces means this element is a vector of length 1
print("This tutorial is awesome")
[1] "This tutorial is awesome"
# This is a vector of length 1 consisting of a single 'string of
# characters'

Vectors 2

print(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
# This vector has 26 character elements
print(LETTERS[6])
[1] "F"
# The sixth element of this vector has length 1
length(LETTERS[6])
[1] 1
# The length of that element is a number with length 1

Matrices

mymat <- matrix(1:36, nrow = 6, ncol = 6)
rownames(mymat) <- LETTERS[1:6]
colnames(mymat) <- LETTERS[7:12]
class(mymat)
[1] "matrix"

Matrices II

rownames(mymat)
[1] "A" "B" "C" "D" "E" "F"
colnames(mymat)
[1] "G" "H" "I" "J" "K" "L"
mymat
  G  H  I  J  K  L
A 1  7 13 19 25 31
B 2  8 14 20 26 32
C 3  9 15 21 27 33
D 4 10 16 22 28 34
E 5 11 17 23 29 35
F 6 12 18 24 30 36

More Matrices

dim(mymat)  # We have 6 rows and 6 columns
[1] 6 6
myvec <- c(5, 3, 5, 6, 1, 2)
length(myvec)  # What happens when you do dim(myvec)?
[1] 6
newmat <- cbind(mymat, myvec)
newmat
  G  H  I  J  K  L myvec
A 1  7 13 19 25 31     5
B 2  8 14 20 26 32     3
C 3  9 15 21 27 33     5
D 4 10 16 22 28 34     6
E 5 11 17 23 29 35     1
F 6 12 18 24 30 36     2

Matrix Functions

foo.mat <- matrix(c(rnorm(100), runif(100), runif(100), rpois(100, 2)), ncol = 4)
head(foo.mat)
        [,1]   [,2]   [,3] [,4]
[1,] -0.5002 0.9585 0.6236    0
[2,] -2.4363 0.7872 0.8873    2
[3,] -1.2383 0.5736 0.9900    1
[4,] -0.9142 0.3383 0.6566    2
[5,] -0.7001 0.1646 0.7553    4
[6,]  0.1098 0.4470 0.9363    1
cor(foo.mat)
           [,1]       [,2]    [,3]     [,4]
[1,]  1.0000000  0.0009384 -0.0840 -0.09061
[2,]  0.0009384  1.0000000 -0.1209 -0.17833
[3,] -0.0839969 -0.1209120  1.0000 -0.09450
[4,] -0.0906143 -0.1783325 -0.0945  1.00000

Converting Matrices

mycorr <- cor(foo.mat)
class(mycorr)
[1] "matrix"
mycorr2 <- as.data.frame(mycorr)
class(mycorr2)
[1] "data.frame"
mycorr2
          V1         V2      V3       V4
1  1.0000000  0.0009384 -0.0840 -0.09061
2  0.0009384  1.0000000 -0.1209 -0.17833
3 -0.0839969 -0.1209120  1.0000 -0.09450
4 -0.0906143 -0.1783325 -0.0945  1.00000

Arrays

myarray <- array(1:42, dim = c(7, 3, 2), dimnames = list(c("tiny", "small", 
    "medium", "medium-ish", "large", "big", "huge"), c("slow", "moderate", "fast"), 
    c("boring", "fun")))
class(myarray)
[1] "array"
dim(myarray)
[1] 7 3 2

Arrays II

dimnames(myarray)
[[1]]
[1] "tiny"       "small"      "medium"     "medium-ish" "large"     
[6] "big"        "huge"      

[[2]]
[1] "slow"     "moderate" "fast"    

[[3]]
[1] "boring" "fun"   
myarray
, , boring

           slow moderate fast
tiny          1        8   15
small         2        9   16
medium        3       10   17
medium-ish    4       11   18
large         5       12   19
big           6       13   20
huge          7       14   21

, , fun

           slow moderate fast
tiny         22       29   36
small        23       30   37
medium       24       31   38
medium-ish   25       32   39
large        26       33   40
big          27       34   41
huge         28       35   42

Lists

mylist <- list(vec = myvec, mat = mymat, arr = myarray, date = mydate)
class(mylist)
[1] "list"
length(mylist)
[1] 4
names(mylist)
[1] "vec"  "mat"  "arr"  "date"

Lists (II)

mylist$vec
[1] 5 3 5 6 1 2
mylist[[2]][1, 3]
[1] 13

So what?

attributes(mylist)
$names
[1] "vec"  "mat"  "arr"  "date"
attributes(myarray)[1:2][2]
$dimnames
$dimnames[[1]]
[1] "tiny"       "small"      "medium"     "medium-ish" "large"     
[6] "big"        "huge"      

$dimnames[[2]]
[1] "slow"     "moderate" "fast"    

$dimnames[[3]]
[1] "boring" "fun"   

Dataframes

str(df[, 25:32])
'data.frame':   2700 obs. of  8 variables:
 $ district  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ schoolhigh: int  0 0 0 0 0 0 0 0 0 0 ...
 $ schoolavg : int  1 1 1 1 1 1 1 1 1 1 ...
 $ schoollow : int  0 0 0 0 0 0 0 0 0 0 ...
 $ readSS    : num  357 264 370 347 373 ...
 $ mathSS    : num  387 303 365 344 441 ...
 $ proflvl   : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
 $ race      : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...

Converting Between Types

Summing it Up

Exercises

  1. Create a matrix of 5x6 dimensions. Add a vector to it (as either a row or column). Identify element 2,3.

  2. Convert the matrix to a data frame.

  3. Look at the difference between data frames and matrices.

Other References

Books

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.8. Here is my session info:

print(sessionInfo(), locale = FALSE)
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] plyr_1.7.1       Cairo_1.5-2      mgcv_1.7-22      hexbin_1.26.0   
 [5] lattice_0.20-10  ggplot2_0.9.2.1  lmtest_0.9-30    zoo_1.7-9       
 [9] knitr_0.8        shiny_0.1.9      websockets_1.1.5 digest_0.5.2    
[13] caTools_1.13     bitops_1.0-4.2  

loaded via a namespace (and not attached):
 [1] colorspace_1.2-0   dichromat_1.2-4    evaluate_0.4.2    
 [4] formatR_0.6        gtable_0.1.1       labeling_0.1      
 [7] markdown_0.5.3     MASS_7.3-22        Matrix_1.0-10     
[10] memoise_0.1        munsell_0.4        nlme_3.1-105      
[13] proto_0.3-9.2      RColorBrewer_1.0-5 reshape2_1.2.1    
[16] RJSONIO_1.0-1      scales_0.2.2       stringr_0.6.1     
[19] tools_2.15.1       xtable_1.7-0      

Attribution and License

This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.

Public Domain Mark