Overview and History of R

Coding Standards

Workspace and Files

R Console and Evaluation

R Objects and Data Structures

Vectors and Lists

  • atomic vector = contains one data type, most basic object
    • vector <- c(value1, value2, ...) = creates a vector with specified values
    • vector1*vector2 = element by element multiplication (rather than matrix multiplication)
      • if the vectors are of different lengths, shorter vector will be recycled until the longer runs out
      • computation on vectors/between vectors (+, -, ==, /, etc.) are done element by element by default
      • %*% = force matrix multiplication between vectors/matrices
    • vector("class", n) = creates empty vector of length n and specified class
      • vector("numeric", 3) = creates 0 0 0
  • c() = concatenate
    • T, F = shorthand for TRUE and FALSE
    • 1+0i = complex numbers
  • explicit coercion
    • as.numeric(x), as.logical(x), as.character(x), as.complex(x) = convert object from one class to another
    • nonsensible coercion will result in NA (ex. as.numeric(c("a", "b"))
    • as.list(data.frame) = converts a data.frame object into a list object
    • as.character(list) = converts list into a character vector
  • implicit coercion
    • matrix/vector can only contain one data type, so when attempting to create matrix/vector with different classes, forced coercion occurs to make every element to same class
      • least common denominator is the approach used (basically everything is converted to a class that all values can take, numbers \(\rightarrow\) characters) and no errors generated
      • coercion occurs to make every element to same class (implicit)
      • x <- c(NA, 2, "D") will create a vector of character class
  • list() = special vector wit different classes of elements
    • list = vector of objects of different classes
    • elements of list use [[]], elements of other vectors use []
  • logical vectors = contain values TRUE, FALSE, and NA, values are generated as result of logical conditions comparing two objects/values
  • paste(characterVector, collapse = " ") = join together elements of the vector and separating with the collapse parameter
  • paste(vec1, vec2, sep = " ") = join together different vectors and separating with the sep parameter
    • Note: vector recycling applies here too
    • LETTERS, letters= predefined vectors for all 26 upper and lower letters
  • unique(values) = returns vector with all duplicates removed

Matrices and Data Frames

  • matrix can contain only 1 type of data
  • data.frame can contain multiple
  • matrix(values, nrow = n, ncol = m) = creates a n by m matrix
    • constructed COLUMN WISE \(\rightarrow\) the elements are placed into the matrix from top to bottom for each column, and by column from left to right
    • matrices can also be created by adding the dimension attribute to vector
      • dim(m) <- c(2, 5)
    • matrices can also be created by binding columns and rows
      • rbind(x, y), cbind(x, y) = combine rows/columns; can be used on vectors or matrices
    • * and / = element by element computation between two matrices
      • %*% = matrix multiplication
  • dim(obj) = dimensions of an object (returns NULL if a vector)
    • dim(obj) <- c(4, 5) = assign dim attribute to an object
      • if object is a vector, R converts the vector to a n by m matrix (i.e. 4 rows by 5 column from the example command)
        • Note: if n by m is larger than length of vector, then an error is returned
      • example
# initiate a vector
x <-c(NA, 1, "cx", NA, 2, "dsa")
class(x)
## [1] "character"
x
## [1] NA    "1"   "cx"  NA    "2"   "dsa"
# convert to matrix
dim(x) <- c(3, 2)
class(x)
## [1] "matrix"
x
##      [,1] [,2] 
## [1,] NA   NA   
## [2,] "1"  "2"  
## [3,] "cx" "dsa"
  • data.frame(var = 1:4, var2 = c(….)) = creates a data frame
    • nrow(), ncol() = returns row and column numbers
    • data.frame(vector, matrix) = takes any number of arguments and returns a single object of class “data.frame” composed of original objects
    • as.data.frame(obj) = converts object to data frame
    • data frames store tabular data
    • special type of list where every list has the same length (can be of different type)
    • data frames are usually created through read.table() and read.csv()
    • data.matrix() = converts a matrix to data frame
  • colMeans(matrix) or rowMeans(matrix) = returns means of the columns/rows of a matrix/dataframe in a vector
  • as.numeric(rownames(df)) = returns row indices for rows of a data frame with unnamed rows
  • attributes
    • objects can have attributes: names, dimnames, row.names, dim (matrices, arrays), class, length, or any user-defined ones
    • attributes(obj), class(obj) = return attributes/class for an R object
    • attr(object, "attribute") <- "value" = creates/assigns a value to a new/existing attribute for the object
    • names attribute
      • all objects can have names
      • names(x) = returns names (NULL if no name exists)
        • names(x) <- c("a", …) = can be used to assign names to vectors
      • list(a = 1, b = 2, …) = a, b are names
      • dimnames(matrix) <- list(c("a", "b"), c("c" , "d")) = assign names to matrices
        • use list of two vectors: row, column in that order
    • colnames(data.frame) = return column names (can be used to set column names as well, similar to dim())
    • row.names = names of rows in the data frame (attribute)

Arrays

  • multi-dimensional collection of data with \(k\) dimensions
    • matrix = 2 dimensional array
  • array(data, dim, dimnames)
    • data = data to be stored in array
    • dim = dimensions of the array
      • dim = c(2, 2, 5) = 3 dimensional array \(\rightarrow\) creates 5 2x2 array
    • dimnames = add names to the dimensions
      • input must be a list
      • every element of the list must correspond in length to the dimensions of the array
      • dimnames(x) <- list(c("a", "b"), c("c", "d"), c("e", "f", "g", "h", "i")) = set the names for row, column, and third dimension respectively (2 x 2 x 5 in this case)
  • dim() function can be used to create arrays from vectors or matrices
    • x <- rnorm(20); dim(x) <- c(2, 2, 5) = converts a 20 element vector to a 2x2x5 array

Factors

  • factors are used to represent categorical data (integer vector where each value has a label)
  • 2 types: unordered vs ordered
  • treated specially by lm(), glm()
  • Factors easier to understand because they self describe (vs. 1 and 2)
  • factor(c("a", "b"), levels = c("1", "2")) = creates factor
    • levels() argument can be used to specify baseline levels vs other levels
      • Note:without explicit specification, R uses alphabetical order
    • table(factorVar) = how many of each are in the factor

Missing Values

Sequence of Numbers

Subsetting

Vectors

  • x[1:10] = first 10 elements of vector x
  • x[is.na(x)] = returns all NA elements
  • x[!is.na(x)] = returns all non NA elements
    • x > 0 = would return logical vector comparing all elements to 0 (TRUE/FALSE for all values except for NA and NA for NA elements (NA a placeholder)
  • x[x>"a"] = selects all elements bigger than a (lexicographical order in place)
  • x[logicalIndex] = select all elements where logical index = TRUE
  • x[-c(2, 10)] = returns everything but the second and tenth element
  • vect <- c(a = 1, b = 2, c = 3) = names values of a vector with corresponding names
  • names(vect) = returns element names for object
    • names(vet) <- c("a", "b", "c") = assign/change names of vector
  • identical(obj1, obj2) = returns TRUE if two objects are exactly equal
  • all.equal(obj1, obj2) = returns TRUE if two objects are near equal

Lists

  • x <- list(foo = 1:4, bar = 0.6)
  • x[1] or x["foo"] = returns the list object foo
  • x[[2]] or x[["bar"]] or x$bar = returns the content of the second element from the list (in this case vector without name attribute)
    • Note: $ can’t extract multiple elements
  • x[c(1, 3)] = extract multiple elements of list
  • x[[name]] = extract using variable, where as $ must match name of element
  • x[[c(1, 3)]] or x[[1]][[3]] = extracted nested elements of list third element of the first object extracted from the list

Matrices

  • x[1, 2] = extract the (row, column) element
    • x[,2] or x[1,] = extract the entire column/row
  • x[ , 11:17] = subset the x data.frame with all rows, but only 11 to 17 columns
  • when an element from the matrix is retrieved, a vector is returned
    • behavior can be turned off (force return a matrix) by adding drop = FALSE
      • x[1, 2, drop = F]

Partial Matching

  • works with [[]] and $
  • $ automatically partial matches the name (x$a)
  • [[]] can partial match by adding exact = FALSE
    • x[["a", exact = false]]

Logic

Understanding Data

Split-Apply-Combine Funtions

split()

  • takes a vector/objects and splits it into group b a factor or list of factors
  • split(x, f, drop = FALSE)
    • x = vector/list/data frame
    • f = factor/list of factors
    • drop = whether empty factor levels should be dropped
  • interactions(gl(2, 5), gl(5, 2)) = 1.1, 1.2, … 2.5
    • gl(n, m) = group level function
      • n = number of levels
      • m = number of repetitions
    • split function can do this by passing in list(f1, f2) in argument
      • split(data, list(gl(2, 5), gl(5, 2))) = splits the data into 1.1, 1.2, … 2.5 levels

apply()

  • evaluate a function (often anonymous) over the margins of an array
  • often used to apply a function to the row/columns of a matrix
  • can be used to average array of matrices (general arrays)
  • apply(x, margin = 2, FUN, ...)
    • x = array
    • MARGIN = 2 (column), 1 (row)
    • FUN = function
    • = other arguments that need to be passed to other functions
  • examples
    • apply(x, 1, sum) or apply(x, 1, mean) = find row sums/means
    • apply(x, 2, sum) or apply(x, 2, mean) = find column sums/means
    • apply(x, 1, quantile, props = c(0.25, 0.75)) = find 25% 75% percentile of each row
    • a <- array(rnorm(2*2*10), c(2, 2, 10)) = create 10 2x2 matrix
    • apply(a, c(1, 2), mean) = returns the means of 10

lapply()

  • loops over a list and evaluate a function on each element and always returns a list
    • Note: since input must be a list, it is possible that conversion may be needed
  • lapply(x, FUN, ...) = takes list/vector as input, applies a function to each element of the list, returns a list of the same length
    • x = list (if not list, will be coerced into list through “as.list”, if not possible —> error)
      • data.frame are treated as collections of lists and can be used here
    • FUN = function (without parentheses)
      • anonymous functions are acceptable here as well - (i.e function(x) x[,1])
    • = other/additional arguments to be passed for FUN (i.e. min, max for runif())
  • example
    • lapply(data.frame, class) = the data.frame is a list of vectors, the class value for each vector is returned in a list (name of function, class, is without parentheses)
    • lapply(values, function(elem), elem[2]) = example of an anonymous function

sapply()

  • performs same function as lapply() except it simplifies the result
    • if result is of length 1 in every element, sapply returns vector
    • if result is vectors of the same length (>1) for each element, sapply returns matrix
    • if not possible to simplify, sapply returns a list (same as lapply())

vapply()

  • safer version of sapply in that it allows to you specify the format for the result
    • vapply(flags, class, character(1)) = returns the class of values in the flags variable in the form of character of length 1 (1 value)

tapply()

  • split data into groups, and apply the function to data within each subgroup
  • tapply(data, INDEX, FUN, ..., simplify = FALSE) = apply a function over subsets of a vector
    • data = vector
    • INDEX = factor/list of factors
    • FUN = function
    • = arguments to be passed to function
    • simplify = whether to simplify the result
  • example
    • x <- c(rnorm(10), runif(10), rnorm(10, 1))
    • f <- gl(3, 10); tapply(x, f, mean) = returns the mean of each group (f level) of x data

mapply()

  • multivariate apply, applies a function in parallel over a set of arguments
  • mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)
    • FUN = function
    • = arguments to apply over
    • MoreArgs = list of other arguments to FUN
    • SIMPLIFY = whether the result should be simplified
  • example
mapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4

aggregate()

  • aggregate computes summary statistics of data subsets (similar to multiple tapply at the same time)
  • aggregate(list(name = dataToCompute), list(name = factorVar1,name = factorVar2), function, na.rm = TRUE)
    • dataToCompute = this is what the function will be applied on
    • factorVar1, factorVar1 = factor variables to split the data by
    • Note: order matters here in terms of how to break down the data
    • function = what is applied to the subsets of data, can be sum/mean/median/etc
    • na.rm = TRUE \(\rightarrow\) removes NA values

Simulation

Simulation Examples

  • rbinom(1, size = 100, prob = 0.7) = returns a binomial random variable that represents the number of successes in a give number of independent trials
    • 1 = corresponds number of observations
    • size = 100 = corresponds with the number of independent trials that culminate to each resultant observation
    • prob = 0.7 = probability of success
  • rnorm(n, mean = m, sd = s) = generate n random samples from the standard normal distribution (mean = 0, std deviation = 1 by default)
    • rnorm(1000) = 1000 draws from the standard normal distribution
    • n = number of observation generated
    • mean = m = specified mean of distribution
    • sd = s = specified standard deviation of distribution
  • dnorm(x, mean = 0, sd = 1, log = FALSE)
    • log = evaluate on log scale
  • pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
    • lower.tail = left side, FALSE = right
  • qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
    • lower.tail = left side, FALSE = right
  • rpois(n, lambda) = generate random samples from the poisson distrbution
    • n = number of observations generated
    • lambda = \(\lambda\) parameter for the poisson distribution or rate
  • rpois(n, r) = generating Poisson Data
    • n = number of values
    • r = rate
  • ppois(n, r) = cumulative distribution
    • ppois(2, 2) = \(Pr(x<=2)\)
  • replicate(n, rpois()) = repeat operation n times

Generate Numbers for a Linear Model

  • Linear model \[ y = \beta_0 + \beta_1 x + \epsilon \\ ~\mbox{where}~ \epsilon \sim N(0, 2^2), x \sim N(0, 1^2), \beta_0 = 0.5, \beta_1 = 2 \]
set.seed(20)
x <- rnorm(100)          # normal
x <- rbinom(100, 1, 0.5) # binomial
e <- rnorm(100, 0, 2)
y <- 0.5 + 2* x + e
  • Poisson model \[ Y ~ Poisson(\mu) \\ log(\mu) = \beta_0 + \beta_1 x \\ ~\mbox{where}~ \beta_0 = 0.5, \beta_1 = 2 \]
x <- rnorm(100)
log.mu <- 0.5 + 0.3* x
y <- rpois(100, exp(log.mu))

Dates and Times

Base Graphics

Reading Tabular Data

Larger Tables

  • Note: help page for read.table important
  • need to know how much RAM is required \(\rightarrow\) calculating memory requirements
    • numRow x numCol x 8 bytes/numeric value = size required in bites
    • double the above results and convert into GB = amount of memory recommended
  • set comment.char = "" to save time if there are no comments in the file
  • specifying colClasses can make reading data much faster
  • nrow = n = number of rows to read in (can help with memory usage)
    • initial <- read.table("file", rows = 100) = read first 100 lines
    • classes <- sapply(initial, class) = determine what classes the columns are
    • tabAll <- read.table("file", colClasses = classes) = load in the entire file with determined classes

Textual Data Formats

  • dump and dput preserve metadata
  • text formats are editable, not space efficient, and work better with version control system (they can only track changes in text files)
  • dput(obj, file = "file.R") = creates R code to store all data and meta data in “file.R” (ex. data, class, names, row.names)
  • dget("file.R") = loads the file/R code and reconstructs the R object
  • dput can only be used on one object, where as dump can be used on multiple objects
  • dump(c("obj1", "obj2"), file= "file2.R") = stores two objects
  • source("file2.R") = loads the objects

Interfaces to the Outside World

  • url() = function can read from webpages
  • file() = read uncompressed files
  • gzfile(), bzfile() = read compressed files (gzip, bzip2)
  • file(description = "", open = "") = file syntax, creates connection
    • description = description of file
    • open = r -readonly, w - writing, a - appending, rb/wb/ab - reading/writing/appending binary
    • close() = closes connection
    • readLines() = can be used to read lines after connection has been established
  • download.file(fileURL, destfile = "fileName", method = "curl")
    • fileURL = url of the file that needs to be downloaded
    • destfile = "fileName" = specifies where the file is to be saved
      • dir/fileName = directories can be referenced here
    • method = "curl" = necessary for downloading files from “https://” links on Macs
      • method = "auto" = should work on all other machines

Control Structures

if - else

# basic structure
if(<condition>) {
    ## do something
} else {
    ## do something else
}

# if tree
if(<condition1>) {
    ## do something
} else if(<condition2>) {
    ## do something different
} else {
    ## do something different
}
  • y <- if(x>3){10} else {0} = slightly different implementation than normal, focus on assigning value

for

# basic structure
for(i in 1:10) {
    # print(i)
}

# nested for loops
x <- matrix(1:6, 2, 3)
for(i in seq_len(nrow(x))) {
    for(j in seq_len(ncol(x))) {
        # print(x[i, j])
    }
}
  • for(letter in x) = loop through letter in character vector
  • seq_along(vector) = create a number sequence from 1 to length of the vector
  • seq_len(length) = create a number sequence that starts at 1 and ends at length specified

while

count <- 0
while(count < 10) {
    # print(count)
    count <- count + 1
}
  • conditions can be combined with logical operators

repeat and break

  • repeat initiates an infinite loop
  • not commonly used in statistical applications but they do have their uses
  • The only way to exit a repeat loop is to call break
x0 <- 1
tol <- 1e-8
repeat {
    x1 <- computeEstimate()
    if(abs(x1 - x0) < tol) {
        break
    } else {
        x0 <- x1 # requires algorithm to converge
    }
}
  • Note: The above loop is a bit dangerous because there’s no guarantee it will stop
    • Better to set a hard limit on the number of iterations (e.g. using a for loop) and then report whether convergence was achieved or not.

next and return

  • next = (no parentheses) skips an element, to continue to the next iteration
  • return = signals that a function should exit and return a given value
for(i in 1:100) {
    if(i <= 20) {
        ## Skip the first 20 iterations
        next
    }
    ## Do something here
}

Functions

Scoping

Scoping Example

make.power <- function(n) {
    pow <- function(x) {
         x^n
    }
    pow
}
cube <- make.power(3)   # defines a function with only n defined (x^3)
square <- make.power(2) # defines a function with only n defined (x^2)
cube(3)                 # defines x = 3
## [1] 27
square(3)               # defines x = 3
## [1] 9
# returns the free variables in the function
ls(environment(cube))
## [1] "n"   "pow"
# retrieves the value of n in the cube function
get("n", environment(cube))
## [1] 3

Lexical vs Dynamic Scoping Example

y <- 10
f <- function(x) {
    y <- 2
    y^2 + g(x)
}
g <- function(x) {
    x*y
}
  • Lexical Scoping
    1. f(3) \(\rightarrow\) calls g(x)
    2. y isn’t defined locally in g(x) \(\rightarrow\) searches in parent environment (working environment/global workspace)
    3. finds y \(\rightarrow\) y = 10
  • Dynamic Scoping
    1. f(3) \(\rightarrow\) calls g(x)
    2. y isn’t defined locally in g(x) \(\rightarrow\) searches in calling environment (f function)
    3. find y \(\rightarrow\) y <- 2
    • parent frame = refers to calling environment in R, environment from which the function was called
  • Note: when the defining environment and calling environment is the same, lexical and dynamic scoping produces the same result
  • Consequences of Lexical Scoping
    • all objects must be carried in memory
    • all functions carry pointer to their defining environment (memory address)

Optimization

  • optimization routines in R (optim, nlm, optimize) require you to pass a function whose argument is a vector of parameters
    • Note: these functions minimize, so use the negative constructs to maximize a normal likelihood
  • constructor functions = functions to be fed into the optimization routines
  • example
# write constructor function
make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) {
    params <- fixed
    function(p) {
         params[!fixed] <- p
         mu <- params[1]
         sigma <- params[2]
         a <- -0.5*length(data)*log(2*pi*sigma^2)
         b <- -0.5*sum((data-mu)^2) / (sigma^2)
         -(a + b)
    }
}
# initialize seed and print function
set.seed(1); normals <- rnorm(100, 1, 2)
nLL <- make.NegLogLik(normals); nLL
## function(p) {
##          params[!fixed] <- p
##          mu <- params[1]
##          sigma <- params[2]
##          a <- -0.5*length(data)*log(2*pi*sigma^2)
##          b <- -0.5*sum((data-mu)^2) / (sigma^2)
##          -(a + b)
##     }
## <environment: 0x7ff878f72bb8>
# Estimating Prameters
optim(c(mu = 0, sigma = 1), nLL)$par
##       mu    sigma 
## 1.218239 1.787343
# Fixing sigma = 2
nLL <- make.NegLogLik(normals, c(FALSE, 2))
optimize(nLL, c(-1, 3))$minimum
## [1] 1.217775
# Fixing mu = 1
nLL <- make.NegLogLik(normals, c(1, FALSE))
optimize(nLL, c(1e-6, 10))$minimum
## [1] 1.800596

Debugging

R Profiler

# system.time example
system.time({
    n <- 1000
    r <- numeric(n)
    for (i in 1:n) {
        x <- rnorm(n)
        r[i] <- mean(x)
    }
})
##    user  system elapsed 
##   0.155   0.004   0.191

Miscellaneous

  • unlist(rss) = converts a list object into data frame/vector
  • ls("package:elasticnet") = list methods in package