# R Programming Language

Notes on R.

NOTE: Some of these notes were copied from paper several months after taking an online course, so there may be errors. Please correct any errors that you find.

## Documentation

help(min)
example(min)

1. Raw data
2. Tidy data set
3. A code book (metadata) that explains each variable
4. A detailed recipe on how you did everything so that it can be reproduced

### Code Book

• You can use something like markdown
• Include a "Study Design" section that describes the data collection
• Include a "Code Book" section that describes each variable
• Include step-by-step instructions for reproducing your output. You can automate it with a script in R, Python, or another language. Don't use parameters on the script. Instructions should be very clear.

## Control Structures

• if, else, else if
• for
• while (be sure to have a base case)
• repeat
• break
• next (skip an iteration)
• return

## Functions

above <- function(x, n) {
use <- x>n    # values greater than n
x[use]        # subsetting the vector x
}
x <- 1:127
above(x, 11)      # returns all numbers from x that are above 11
• named arguments -- potentially have default values
• formal arguments -- arguments included in the function definition
• formals() -- returns a list of formal arguments

## Objects

objects()
ls()
rm(x, y, z, foo, bar)

Saving objects at the end of session puts them in the current directory: .RData, which is loaded if R is run from the same directory later.

## Data Types

• Vector -- 1D, all items are the same type. Create with concat or other method: v <- c(1, 2, 3)
• Matrix -- 2D, all items are the same type. Example: m <- matrix(1:12, 3, 4)
• Array -- multi-dimensional, all items are the same type
• List -- multi-dimensional, containing any data type
• Data Frame -- a kind of list with all elements having the same length
# Find the variable type
typeof(variableName)
• atomic classes:
• numeric
• logical
• character
• integer
• complex

Also:

• vectors -- all are same class. x <- c(10.4, 5.6, 3.1, 12.7)
• lists -- can have different classes
• factors -- x <- factor(c("yes", "yes", "no")); table(x)
• missing values -- is.na, is.nan. NaN is na, but not vice versa
• data frames -- tabular data
• names [?]

### Vectors

• +, -, *, /, ^
• Logical: <, <=, >, >=, ==, !=
• exp, sin, cos, tan, sqrt, min, max, range, sum, prod, mean, var, sort, order, sort.list(), pmax, pmin
v <- c(1, 2, 3)

### Matrices

# Fill with a range
m <- matrix(1:12, 3, 4)

# Fill with a single value
matrix(1, 5, 5)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    1    1    1
# [2,]    1    1    1    1    1
# [3,]    1    1    1    1    1
# [4,]    1    1    1    1    1
# [5,]    1    1    1    1    1

# convert vector to matrix by adding dimensions
x <- 1:12
# [1]  1  2  3  4  5  6  7  8  9 10 11 12
dim(x) <- c(3, 4)
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    2    5    8   11
# [3,]    3    6    9   12

# Queries
x[3, 2] # 6

# Column
x[,2] # [1] 4 5 6

# Row
x[3,] # [1]  3  6  9 12

# Multiple columns or rows
x[,3:4]
#      [,1] [,2]
# [1,]    7   10
# [2,]    8   11
# [3,]    9   12

x[c(1, 3),]
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    3    6    9   12

### Factors

Certain preset values that allow you to categorize data.

Creating a factor with four levels:

# Note the repeat values
species <- c("Human", "Vulcan", "Zeta Reticulum", "Vulcan", "Vogon", "Human")
types <- factor(species)
print(types)
# [1] Human          Vulcan         Zeta Reticulum Vulcan         Vogon          Human
# Levels: Human Vogon Vulcan Zeta Reticulum
as.integer(types)
# [1] 1 3 4 3 2 1
levels(types)
# [1] "Human"          "Vogon"          "Vulcan"         "Zeta Reticulum"

You can then use the labels as categories, for example, as plotting symbols:

outpostPopulation <- c(127, 357, 400, 2101, 992, 5757)
outpostGDP <- c(48392, 38744, 23459, 51939, 77777, 20021)
plot(outpostPopulation, outpostGDP, pch = as.integer(types))
legend("topright", levels(types), pch = 1:length(levels(types)))

(sorry for an example with meaningless data)

### Sequences

Like generating ranges in Python.

c(1, 2, 3, 4, 5) # c is for combine
1:5 # creates a range
5:1 # 5, 4, 3, 2, 1
seq(-5, 5, by=.2) -> s3
s4 <- seq(length=10, from=-5, by=.2)
s5 <- rep(x, times=5)
s6 <- rep(x, each=5)

# Loop examples
for (i in 1:10) { print(i) }
for (i in 1:10) { print(x[i]) }
for (i in 1:10) print(x[i])
for (i in seq_along(x)) { print(x[i]) }
for (letter in x) { print(letter) }
for (i in 1:10) { print(i) }

## Names

You can assign names, even to vectors:

wins <- c(5, 21, 7, 1)
names(wins) <- c("Orcs", "Dragons", "Goblins", "Kobolds")
wins
# Orcs Dragons Goblins Kobolds
#    5      21       7       1
barplot(wins)

## Data Frames

• For tabular data.
• Every element has to have the same length.
• Each element is a column.
• The length of each element is number of rows.
• While matrices have to contain items of the same class, dataframes can have different classes like lists.

And:

• attribute row.names
• create with read.table() or read.csv()
• convert to matrix data.matrix()
x <- 1:3 # vector
names(x) <- c("foo", "bar", "baz") # assign column names?

# Or:
x <- list(a=1,b=2, c=3) # list
x[c(1,3)] # Query the list with single brackets.

### Example

outposts <- data.frame(outpostPopulation, outpostGDP, types)
> outposts
#   outpostPopulation outpostGDP          types
# 1               127      48392          Human
# 2               357      38744         Vulcan
# 3               400      23459 Zeta Reticulum
# 4              2101      51939         Vulcan
# 5               992      77777          Vogon
# 6              5757      20021          Human

# Getting columns

## data.table Package

fread()
tables()
# quotes not needed for vars
DT[c(2,3)] # doesn't work like dataframe
DT[,list(mean(x), sum(z))]
DT[,m:={tmp <- (x+2);log2(tmp+5}]

## Files

### Example

dir() # show directory contents
list.files() # show directory contents
getwd() # show the current working directory

# display summary
# If the data are factors, then it shows the number of each item, in this case a column of country names
summary(myData$Country) names(myData) # column names attributes(myData) columnA <- myData$A # just give it the column name after the $sign ### Estimating Memory Estimating memory requirements for working with a file. Example: 1.5 million rows with 120 columns of numeric (8 bytes) data. $$1,500,000\cdot120\cdot8=1,440,000,000 bytes\\ 1,440,000,000 bytes/2^{20}=1,048,576Mb\\ 1,048,576Mb/2^{10}=1.34Gb$$ You probably need twice that (2.68 Gb) to read the file in. ### General • Read in some command: source("commands.R") • Save subsequent output to a file: sink("record.lis") • Stop recording output to the external file: sink() More here. ### Read • read.table -- tabs • read.csv -- commas • readLines -- text files • dget -- for R code • load -- binary code • unserialize -- binary code ### Write • write.table • writeLines • dump -- preserve metadata. Multiple objects: dump(c("x", "y"), file="data.R") • dput -- preserve metadata. Single objects: dput(y, file="y.R") • save • serialize ### Network fileUrl <- "https://example.com/somefile.csv" download.file(fileUrl, destfile="./data/myfile.csv", method="curl") list.files("./data") dateDownloaded <- date() • download.file() ### Interfaces • file • gzfile • bzfile • url con <- gzfile("words.gz") x <- readLines(con, 10) con <- url("http://example.com/somefile", "r") x <- readLines(con) head(x) ### Directories • getwd() -- R's working directory • setwd() -- R's working directory • unlink() -- remove directories if(!file.exists("data")) { dir.create("data") } ### Excel library(xlsx) read.xlsx("./data/somefile.xlsx", sheetIndex=1, header=TRUE) ## Counting Things length(which(!is.na(a))) length(which(a != "Foo")) # something might be incorrect on the next lines due to handwriting ambiguity: nrow(data[data$Column==5 & !is.na(data$Column)]) sum(!is.na(data$Col[data$Col=5]) ## Random Samples When using sample(), the argument, replace=TRUE allows repeat values -- otherwise they will be unique. rollDice <- function(sides, numRolls) { die <- 1:sides; sample(die, size=numRolls, replace=TRUE); } ## Distributions help(Distributions) Distribution functions have prefixes: • d -- height of probability density function • p -- cumulative density function • q -- inverse cumulative density function (quantiles) • r -- random numbers ## Plotting ### Strip Charts # Some strip charts from a column of numbers stripchart(myData$ColumnOfNumbers, method="stack")
stripchart(myData$ColumnOfNumbers, vertical=TRUE) stripchart(myData$ColumnOfNumbers,
method="stack",
xlab="Some text for the bottom")

### Histograms

# Histograms
hist(myData$ColumnOfNumbers) hist(myData$ColumnOfNumbers, main="Main Header Text", xlab="A label")
boxplot(myData$ColumnOfNumbers, horizontal=TRUE) ### Scatter Plots See the relationship between two sets of numbers. plot(myData$ColumnOfNumbers, myData$AnotherColumn) # With labels plot(myData$ColumnOfNumbers, myData$AnotherColumn, main="The Main Header", xlab="X Label", ylab="Y Label") # Correlation cor(myData$ColumnOfNumbers, myData$AnotherColumn) ### Quantile-Quantile (QQ) Plots Plot a sample against normal distribution: qqnorm(myData$ColumnOfNumbers,
main="Normal Q-Q Plot",
xlab="X Label",
ylab="Y Label")

qqline(myData\$ColumnOfNumbers)

### Contour Maps

contour(volcano)

### Perspective Plots

persp(volcano, expand=0.2)

### Heat Maps

image(volcano)

## Standard Deviation

Find the typical range of values. It's the square root of the variance, which could be manually calculated like in the example below. (A Python version is here.)

# Manual example
standardDeviation <- function(v, population = TRUE) {
numItems <- length(v);
differences <- v - mean(v);
sumSquaredDifferences <- sum(differences^2);

if (population == TRUE) {
# Give the POPULATION standard deviation
variance <- sumSquaredDifferences / numItems;
} else {
# Give the SAMPLE standard deviation
variance <- sumSquaredDifferences / (numItems - 1);
}
sqrt(variance);
}

s <- c(98, 127, 133, 147, 170, 197, 201, 211, 255)
standardDeviation(s, population = FALSE)
# [1] 49.39383

The built-in function:

# sd() gives you the sample standard distribution
s <- c(98, 127, 133, 147, 170, 197, 201, 211, 255)
sd(s)
# [1] 49.39383