R Programming Language

From Code Self Study Wiki
Jump to: navigation, search

Notes on R.

NOTE: Some of these notes were copied from paper several months after taking an online course, so there may be errors. Please correct any errors that you find.

Documentation

help(min)
example(min)

Documenting Your Project

Your final product should include:

  1. Raw data
  2. Tidy data set
  3. A code book (metadata) that explains each variable
  4. A detailed recipe on how you did everything so that it can be reproduced

Code Book

  • You can use something like markdown
  • Include a "Study Design" section that describes the data collection
  • Include a "Code Book" section that describes each variable
  • Include step-by-step instructions for reproducing your output. You can automate it with a script in R, Python, or another language. Don't use parameters on the script. Instructions should be very clear.

Control Structures

  • if, else, else if
  • for
  • while (be sure to have a base case)
  • repeat
  • break
  • next (skip an iteration)
  • return

Functions

above <- function(x, n) {
    use <- x>n    # values greater than n
    x[use]        # subsetting the vector x
}
x <- 1:127
above(x, 11)      # returns all numbers from x that are above 11
  • named arguments -- potentially have default values
  • formal arguments -- arguments included in the function definition
  • formals() -- returns a list of formal arguments

Objects

objects()
ls()
rm(x, y, z, foo, bar)

Saving objects at the end of session puts them in the current directory: .RData, which is loaded if R is run from the same directory later.

Data Types

  • Vector -- 1D, all items are the same type. Create with concat or other method: v <- c(1, 2, 3)
  • Matrix -- 2D, all items are the same type. Example: m <- matrix(1:12, 3, 4)
  • Array -- multi-dimensional, all items are the same type
  • List -- multi-dimensional, containing any data type
  • Data Frame -- a kind of list with all elements having the same length
# Find the variable type
typeof(variableName)
  • atomic classes:
    • numeric
    • logical
    • character
    • integer
    • complex

Also:

  • vectors -- all are same class. x <- c(10.4, 5.6, 3.1, 12.7)
  • lists -- can have different classes
  • factors -- x <- factor(c("yes", "yes", "no")); table(x)
  • missing values -- is.na, is.nan. NaN is na, but not vice versa
  • data frames -- tabular data
  • names [?]

Vectors

  • +, -, *, /, ^
  • Logical: <, <=, >, >=, ==, !=
  • exp, sin, cos, tan, sqrt, min, max, range, sum, prod, mean, var, sort, order, sort.list(), pmax, pmin
v <- c(1, 2, 3)

Matrices

# Fill with a range
m <- matrix(1:12, 3, 4)
 
# Fill with a single value
matrix(1, 5, 5)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    1    1    1
# [2,]    1    1    1    1    1
# [3,]    1    1    1    1    1
# [4,]    1    1    1    1    1
# [5,]    1    1    1    1    1
 
# convert vector to matrix by adding dimensions
x <- 1:12
# [1]  1  2  3  4  5  6  7  8  9 10 11 12
dim(x) <- c(3, 4)
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    2    5    8   11
# [3,]    3    6    9   12
 
# Queries
x[3, 2] # 6
 
# Column
x[,2] # [1] 4 5 6
 
# Row
x[3,] # [1]  3  6  9 12
 
# Multiple columns or rows
x[,3:4]
#      [,1] [,2]
# [1,]    7   10
# [2,]    8   11
# [3,]    9   12
 
x[c(1, 3),]
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    3    6    9   12

Factors

Certain preset values that allow you to categorize data.

Creating a factor with four levels:

# Note the repeat values
species <- c("Human", "Vulcan", "Zeta Reticulum", "Vulcan", "Vogon", "Human")
types <- factor(species)
print(types)
# [1] Human          Vulcan         Zeta Reticulum Vulcan         Vogon          Human         
# Levels: Human Vogon Vulcan Zeta Reticulum
as.integer(types)
# [1] 1 3 4 3 2 1
levels(types)
# [1] "Human"          "Vogon"          "Vulcan"         "Zeta Reticulum"

You can then use the labels as categories, for example, as plotting symbols:

outpostPopulation <- c(127, 357, 400, 2101, 992, 5757)
outpostGDP <- c(48392, 38744, 23459, 51939, 77777, 20021)
plot(outpostPopulation, outpostGDP, pch = as.integer(types))
legend("topright", levels(types), pch = 1:length(levels(types)))

(sorry for an example with meaningless data)

Alien Outpost GDP.png

Sequences

Like generating ranges in Python.

c(1, 2, 3, 4, 5) # c is for combine
1:5 # creates a range
5:1 # 5, 4, 3, 2, 1
seq(-5, 5, by=.2) -> s3
s4 <- seq(length=10, from=-5, by=.2)
s5 <- rep(x, times=5)
s6 <- rep(x, each=5)
 
# Loop examples
for (i in 1:10) { print(i) }
for (i in 1:10) { print(x[i]) }
for (i in 1:10) print(x[i])
for (i in seq_along(x)) { print(x[i]) }
for (letter in x) { print(letter) }
for (i in 1:10) { print(i) }

Names

You can assign names, even to vectors:

wins <- c(5, 21, 7, 1)
names(wins) <- c("Orcs", "Dragons", "Goblins", "Kobolds")
wins
# Orcs Dragons Goblins Kobolds 
#    5      21       7       1 
barplot(wins)

Monster Battles.png

Data Frames

  • For tabular data.
  • Every element has to have the same length.
  • Each element is a column.
  • The length of each element is number of rows.
  • While matrices have to contain items of the same class, dataframes can have different classes like lists.

And:

  • attribute row.names
  • create with read.table() or read.csv()
  • convert to matrix data.matrix()
x <- 1:3 # vector
names(x) <- c("foo", "bar", "baz") # assign column names?
 
# Or:
x <- list(a=1,b=2, c=3) # list
x[c(1,3)] # Query the list with single brackets.

Example

outposts <- data.frame(outpostPopulation, outpostGDP, types)
> outposts
#   outpostPopulation outpostGDP          types
# 1               127      48392          Human
# 2               357      38744         Vulcan
# 3               400      23459 Zeta Reticulum
# 4              2101      51939         Vulcan
# 5               992      77777          Vogon
# 6              5757      20021          Human
 
# Getting columns
outposts$outpostGDP
# [1] 48392 38744 23459 51939 77777 20021
outposts[[2]]
# [1] 48392 38744 23459 51939 77777 20021
outposts[["outpostGDP"]]
# [1] 48392 38744 23459 51939 77777 20021

Removing Rows

This return rows that contain "keyword":

myDataFrame[(grepl("keyword", myDataFrame$ColumnName)),]

data.table Package

fread()
tables()
# quotes not needed for vars
DT[c(2,3)] # doesn't work like dataframe
DT[,list(mean(x), sum(z))]
DT[,w:=z^2] # add new column
DT[,m:={tmp <- (x+2);log2(tmp+5}]

Files

Example

dir() # show directory contents
list.files() # show directory contents
getwd() # show the current working directory
myData <- read.csv(file="someData.csv", header=TRUE, sep=",") # reads the file into a data frame
 
# display summary
# If the data are factors, then it shows the number of each item, in this case a column of country names
summary(myData$Country)
names(myData) # column names
attributes(myData)
columnA <- myData$A # just give it the column name after the $ sign

Estimating Memory

Estimating memory requirements for working with a file.

Example: 1.5 million rows with 120 columns of numeric (8 bytes) data.

\(1,500,000\cdot120\cdot8=1,440,000,000 bytes\\ 1,440,000,000 bytes/2^{20}=1,048,576Mb\\ 1,048,576Mb/2^{10}=1.34Gb\)

You probably need twice that (2.68 Gb) to read the file in.

General

  • Read in some command: source("commands.R")
  • Save subsequent output to a file: sink("record.lis")
  • Stop recording output to the external file: sink()

More here.

Read

  • read.table -- tabs
  • read.csv -- commas
  • readLines -- text files
  • dget -- for R code
  • load -- binary code
  • unserialize -- binary code

Write

  • write.table
  • writeLines
  • dump -- preserve metadata. Multiple objects: dump(c("x", "y"), file="data.R")
  • dput -- preserve metadata. Single objects: dput(y, file="y.R")
  • save
  • serialize

Network

fileUrl <- "https://example.com/somefile.csv"
download.file(fileUrl, destfile="./data/myfile.csv", method="curl")
list.files("./data")
dateDownloaded <- date()
  • download.file()

Interfaces

  • file
  • gzfile
  • bzfile
  • url
con <- gzfile("words.gz")
x <- readLines(con, 10)
con <- url("http://example.com/somefile", "r")
x <- readLines(con)
head(x)

Directories

  • getwd() -- R's working directory
  • setwd() -- R's working directory
  • unlink() -- remove directories
if(!file.exists("data")) {
    dir.create("data")
}

Excel

library(xlsx)
read.xlsx("./data/somefile.xlsx", sheetIndex=1, header=TRUE)

Counting Things

length(which(!is.na(a)))
length(which(a != "Foo"))
 
# something might be incorrect on the next lines due to handwriting ambiguity:
nrow(data[data$Column==5 & !is.na(data$Column)])
sum(!is.na(data$Col[data$Col=5])

Random Samples

When using sample(), the argument, replace=TRUE allows repeat values -- otherwise they will be unique.

rollDice <- function(sides, numRolls) {
    die <- 1:sides;
    sample(die, size=numRolls, replace=TRUE);
}

Distributions

help(Distributions)

Distribution functions have prefixes:

  • d -- height of probability density function
  • p -- cumulative density function
  • q -- inverse cumulative density function (quantiles)
  • r -- random numbers

Plotting

Strip Charts

# Some strip charts from a column of numbers
stripchart(myData$ColumnOfNumbers, method="stack")
stripchart(myData$ColumnOfNumbers, vertical=TRUE)
stripchart(myData$ColumnOfNumbers,
           method="stack",
           main="Some Header Text",
           xlab="Some text for the bottom")

Histograms

# Histograms
hist(myData$ColumnOfNumbers)
hist(myData$ColumnOfNumbers, main="Main Header Text", xlab="A label")
hist(myData$ColumnOfNumbers, breaks=10)

Bar Plots

ages <- c(33, 111, 180, 600, 969)
names(ages) <- c("Frodo", "Bilbo", "Isaac", "Noah", "Methuselah")
barplot(ages)
 
# Add a horizontal line at the mean
abline(h=mean(ages))

Box Plots

boxplot(myData$ColumnOfNumbers)
boxplot(myData$ColumnOfNumbers, horizontal=TRUE)

Scatter Plots

See the relationship between two sets of numbers.

plot(myData$ColumnOfNumbers, myData$AnotherColumn)
 
# With labels
plot(myData$ColumnOfNumbers, myData$AnotherColumn,
     main="The Main Header",
     xlab="X Label",
     ylab="Y Label")
 
# Correlation
cor(myData$ColumnOfNumbers, myData$AnotherColumn)

Quantile-Quantile (QQ) Plots

Plot a sample against normal distribution:

qqnorm(myData$ColumnOfNumbers,
       main="Normal Q-Q Plot",
       xlab="X Label",
       ylab="Y Label")
 
# Add the qqline
qqline(myData$ColumnOfNumbers)

Contour Maps

contour(volcano)

Contour Map Volcano.png

Perspective Plots

persp(volcano, expand=0.2)

Perspective Plot Volcano.png

Heat Maps

image(volcano)

Image Heat Map.png

Standard Deviation

Find the typical range of values. It's the square root of the variance, which could be manually calculated like in the example below. (A Python version is here.)

# Manual example
standardDeviation <- function(v, population = TRUE) {
    numItems <- length(v);
    differences <- v - mean(v);
    sumSquaredDifferences <- sum(differences^2);
 
    if (population == TRUE) {
        # Give the POPULATION standard deviation
        variance <- sumSquaredDifferences / numItems;
    } else {
        # Give the SAMPLE standard deviation
        variance <- sumSquaredDifferences / (numItems - 1);
    }
    sqrt(variance);
}
 
s <- c(98, 127, 133, 147, 170, 197, 201, 211, 255)
standardDeviation(s, population = FALSE)
# [1] 49.39383

The built-in function:

# sd() gives you the sample standard distribution
s <- c(98, 127, 133, 147, 170, 197, 201, 211, 255)
sd(s)
# [1] 49.39383

Resources