R Programming Language

From Code Self Study Wiki
Jump to: navigation, search

Notes on R.

NOTE: Some of these notes were copied from paper several months after taking an online course, so there may be errors. Please correct any errors that you find.

Documentation[edit]

help(min)
example(min)

Documenting Your Project[edit]

Your final product should include:

  1. Raw data
  2. Tidy data set
  3. A code book (metadata) that explains each variable
  4. A detailed recipe on how you did everything so that it can be reproduced

Code Book[edit]

  • You can use something like markdown
  • Include a "Study Design" section that describes the data collection
  • Include a "Code Book" section that describes each variable
  • Include step-by-step instructions for reproducing your output. You can automate it with a script in R, Python, or another language. Don't use parameters on the script. Instructions should be very clear.

Control Structures[edit]

  • if, else, else if
  • for
  • while (be sure to have a base case)
  • repeat
  • break
  • next (skip an iteration)
  • return

Functions[edit]

above <- function(x, n) {
    use <- x>n    # values greater than n
    x[use]        # subsetting the vector x
}
x <- 1:127
above(x, 11)      # returns all numbers from x that are above 11
  • named arguments -- potentially have default values
  • formal arguments -- arguments included in the function definition
  • formals() -- returns a list of formal arguments

Objects[edit]

objects()
ls()
rm(x, y, z, foo, bar)

Saving objects at the end of session puts them in the current directory: .RData, which is loaded if R is run from the same directory later.

Data Types[edit]

  • Vector -- 1D, all items are the same type. Create with concat or other method: v <- c(1, 2, 3)
  • Matrix -- 2D, all items are the same type. Example: m <- matrix(1:12, 3, 4)
  • Array -- multi-dimensional, all items are the same type
  • List -- multi-dimensional, containing any data type
  • Data Frame -- a kind of list with all elements having the same length
# Find the variable type
typeof(variableName)
  • atomic classes:
    • numeric
    • logical
    • character
    • integer
    • complex

Also:

  • vectors -- all are same class. x <- c(10.4, 5.6, 3.1, 12.7)
  • lists -- can have different classes
  • factors -- x <- factor(c("yes", "yes", "no")); table(x)
  • missing values -- is.na, is.nan. NaN is na, but not vice versa
  • data frames -- tabular data
  • names [?]

Vectors[edit]

  • +, -, *, /, ^
  • Logical: <, <=, >, >=, ==, !=
  • exp, sin, cos, tan, sqrt, min, max, range, sum, prod, mean, var, sort, order, sort.list(), pmax, pmin
v <- c(1, 2, 3)

Matrices[edit]

# Fill with a range
m <- matrix(1:12, 3, 4)
 
# Fill with a single value
matrix(1, 5, 5)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    1    1    1
# [2,]    1    1    1    1    1
# [3,]    1    1    1    1    1
# [4,]    1    1    1    1    1
# [5,]    1    1    1    1    1
 
# convert vector to matrix by adding dimensions
x <- 1:12
# [1]  1  2  3  4  5  6  7  8  9 10 11 12
dim(x) <- c(3, 4)
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    2    5    8   11
# [3,]    3    6    9   12
 
# Queries
x[3, 2] # 6
 
# Column
x[,2] # [1] 4 5 6
 
# Row
x[3,] # [1]  3  6  9 12
 
# Multiple columns or rows
x[,3:4]
#      [,1] [,2]
# [1,]    7   10
# [2,]    8   11
# [3,]    9   12
 
x[c(1, 3),]
#      [,1] [,2] [,3] [,4]
# [1,]    1    4    7   10
# [2,]    3    6    9   12

Factors[edit]

Certain preset values that allow you to categorize data.

Creating a factor with four levels:

# Note the repeat values
species <- c("Human", "Vulcan", "Zeta Reticulum", "Vulcan", "Vogon", "Human")
types <- factor(species)
print(types)
# [1] Human          Vulcan         Zeta Reticulum Vulcan         Vogon          Human         
# Levels: Human Vogon Vulcan Zeta Reticulum
as.integer(types)
# [1] 1 3 4 3 2 1
levels(types)
# [1] "Human"          "Vogon"          "Vulcan"         "Zeta Reticulum"

You can then use the labels as categories, for example, as plotting symbols:

outpostPopulation <- c(127, 357, 400, 2101, 992, 5757)
outpostGDP <- c(48392, 38744, 23459, 51939, 77777, 20021)
plot(outpostPopulation, outpostGDP, pch = as.integer(types))
legend("topright", levels(types), pch = 1:length(levels(types)))

(sorry for an example with meaningless data)

Alien Outpost GDP.png

Sequences[edit]

Like generating ranges in Python.

c(1, 2, 3, 4, 5) # c is for combine
1:5 # creates a range
5:1 # 5, 4, 3, 2, 1
seq(-5, 5, by=.2) -> s3
s4 <- seq(length=10, from=-5, by=.2)
s5 <- rep(x, times=5)
s6 <- rep(x, each=5)
 
# Loop examples
for (i in 1:10) { print(i) }
for (i in 1:10) { print(x[i]) }
for (i in 1:10) print(x[i])
for (i in seq_along(x)) { print(x[i]) }
for (letter in x) { print(letter) }
for (i in 1:10) { print(i) }

Names[edit]

You can assign names, even to vectors:

wins <- c(5, 21, 7, 1)
names(wins) <- c("Orcs", "Dragons", "Goblins", "Kobolds")
wins
# Orcs Dragons Goblins Kobolds 
#    5      21       7       1 
barplot(wins)

Monster Battles.png

Data Frames[edit]

  • For tabular data.
  • Every element has to have the same length.
  • Each element is a column.
  • The length of each element is number of rows.
  • While matrices have to contain items of the same class, dataframes can have different classes like lists.

And:

  • attribute row.names
  • create with read.table() or read.csv()
  • convert to matrix data.matrix()
x <- 1:3 # vector
names(x) <- c("foo", "bar", "baz") # assign column names?
 
# Or:
x <- list(a=1,b=2, c=3) # list
x[c(1,3)] # Query the list with single brackets.

Example[edit]

outposts <- data.frame(outpostPopulation, outpostGDP, types)
> outposts
#   outpostPopulation outpostGDP          types
# 1               127      48392          Human
# 2               357      38744         Vulcan
# 3               400      23459 Zeta Reticulum
# 4              2101      51939         Vulcan
# 5               992      77777          Vogon
# 6              5757      20021          Human
 
# Getting columns
outposts$outpostGDP
# [1] 48392 38744 23459 51939 77777 20021
outposts[[2]]
# [1] 48392 38744 23459 51939 77777 20021
outposts[["outpostGDP"]]
# [1] 48392 38744 23459 51939 77777 20021

Removing Rows[edit]

This return rows that contain "keyword":

myDataFrame[(grepl("keyword", myDataFrame$ColumnName)),]

data.table Package[edit]

fread()
tables()
# quotes not needed for vars
DT[c(2,3)] # doesn't work like dataframe
DT[,list(mean(x), sum(z))]
DT[,w:=z^2] # add new column
DT[,m:={tmp <- (x+2);log2(tmp+5}]

Files[edit]

Example[edit]

dir() # show directory contents
list.files() # show directory contents
getwd() # show the current working directory
myData <- read.csv(file="someData.csv", header=TRUE, sep=",") # reads the file into a data frame
 
# display summary
# If the data are factors, then it shows the number of each item, in this case a column of country names
summary(myData$Country)
names(myData) # column names
attributes(myData)
columnA <- myData$A # just give it the column name after the $ sign

Estimating Memory[edit]

Estimating memory requirements for working with a file.

Example: 1.5 million rows with 120 columns of numeric (8 bytes) data.

\(1,500,000\cdot120\cdot8=1,440,000,000 bytes\\ 1,440,000,000 bytes/2^{20}=1,048,576Mb\\ 1,048,576Mb/2^{10}=1.34Gb\)

You probably need twice that (2.68 Gb) to read the file in.

General[edit]

  • Read in some command: source("commands.R")
  • Save subsequent output to a file: sink("record.lis")
  • Stop recording output to the external file: sink()

More here.

Read[edit]

  • read.table -- tabs
  • read.csv -- commas
  • readLines -- text files
  • dget -- for R code
  • load -- binary code
  • unserialize -- binary code

Write[edit]

  • write.table
  • writeLines
  • dump -- preserve metadata. Multiple objects: dump(c("x", "y"), file="data.R")
  • dput -- preserve metadata. Single objects: dput(y, file="y.R")
  • save
  • serialize

Network[edit]

fileUrl <- "https://example.com/somefile.csv"
download.file(fileUrl, destfile="./data/myfile.csv", method="curl")
list.files("./data")
dateDownloaded <- date()
  • download.file()

Interfaces[edit]

  • file
  • gzfile
  • bzfile
  • url
con <- gzfile("words.gz")
x <- readLines(con, 10)
con <- url("http://example.com/somefile", "r")
x <- readLines(con)
head(x)

Directories[edit]

  • getwd() -- R's working directory
  • setwd() -- R's working directory
  • unlink() -- remove directories
if(!file.exists("data")) {
    dir.create("data")
}

Excel[edit]

library(xlsx)
read.xlsx("./data/somefile.xlsx", sheetIndex=1, header=TRUE)

Counting Things[edit]

length(which(!is.na(a)))
length(which(a != "Foo"))
 
# something might be incorrect on the next lines due to handwriting ambiguity:
nrow(data[data$Column==5 & !is.na(data$Column)])
sum(!is.na(data$Col[data$Col=5])

Random Samples[edit]

When using sample(), the argument, replace=TRUE allows repeat values -- otherwise they will be unique.

rollDice <- function(sides, numRolls) {
    die <- 1:sides;
    sample(die, size=numRolls, replace=TRUE);
}

Distributions[edit]

help(Distributions)

Distribution functions have prefixes:

  • d -- height of probability density function
  • p -- cumulative density function
  • q -- inverse cumulative density function (quantiles)
  • r -- random numbers

Plotting[edit]

Strip Charts[edit]

# Some strip charts from a column of numbers
stripchart(myData$ColumnOfNumbers, method="stack")
stripchart(myData$ColumnOfNumbers, vertical=TRUE)
stripchart(myData$ColumnOfNumbers,
           method="stack",
           main="Some Header Text",
           xlab="Some text for the bottom")

Histograms[edit]

# Histograms
hist(myData$ColumnOfNumbers)
hist(myData$ColumnOfNumbers, main="Main Header Text", xlab="A label")
hist(myData$ColumnOfNumbers, breaks=10)

Bar Plots[edit]

ages <- c(33, 111, 180, 600, 969)
names(ages) <- c("Frodo", "Bilbo", "Isaac", "Noah", "Methuselah")
barplot(ages)
 
# Add a horizontal line at the mean
abline(h=mean(ages))

Box Plots[edit]

boxplot(myData$ColumnOfNumbers)
boxplot(myData$ColumnOfNumbers, horizontal=TRUE)

Scatter Plots[edit]

See the relationship between two sets of numbers.

plot(myData$ColumnOfNumbers, myData$AnotherColumn)
 
# With labels
plot(myData$ColumnOfNumbers, myData$AnotherColumn,
     main="The Main Header",
     xlab="X Label",
     ylab="Y Label")
 
# Correlation
cor(myData$ColumnOfNumbers, myData$AnotherColumn)

Quantile-Quantile (QQ) Plots[edit]

Plot a sample against normal distribution:

qqnorm(myData$ColumnOfNumbers,
       main="Normal Q-Q Plot",
       xlab="X Label",
       ylab="Y Label")
 
# Add the qqline
qqline(myData$ColumnOfNumbers)

Contour Maps[edit]

contour(volcano)

Contour Map Volcano.png

Perspective Plots[edit]

persp(volcano, expand=0.2)

Perspective Plot Volcano.png

Heat Maps[edit]

image(volcano)

Image Heat Map.png

Standard Deviation[edit]

Find the typical range of values. It's the square root of the variance, which could be manually calculated like in the example below. (A Python version is here.)

# Manual example
standardDeviation <- function(v, population = TRUE) {
    numItems <- length(v);
    differences <- v - mean(v);
    sumSquaredDifferences <- sum(differences^2);
 
    if (population == TRUE) {
        # Give the POPULATION standard deviation
        variance <- sumSquaredDifferences / numItems;
    } else {
        # Give the SAMPLE standard deviation
        variance <- sumSquaredDifferences / (numItems - 1);
    }
    sqrt(variance);
}
 
s <- c(98, 127, 133, 147, 170, 197, 201, 211, 255)
standardDeviation(s, population = FALSE)
# [1] 49.39383

The built-in function:

# sd() gives you the sample standard distribution
s <- c(98, 127, 133, 147, 170, 197, 201, 211, 255)
sd(s)
# [1] 49.39383

Resources[edit]