Cheatsheet for R & Rstudio
?<anything-here> # Help/manual for that thing
Install packages once from CRAN/Bioconductor/devtools
# CRAN
install.packages("dplyr")
# Github via devtools (Install Devtools dependecies first)
install_github()
Load package each time we use RStudio
library(dplyr)
Consider doing data manipulation in SAS/SQL/Python/MapReduce etc., as R is not great wth manipulations of large-scale data.
Unless it's in ASCII, check encoding first with cmd file
or od
.
Note: BOMs in UTF-8 files cause issues when moving between Windows vs. UNIX, so ensure you know whether it has it beforehand.
CSV
mydata <- read.csv("filename.txt")
More Generic
mydata <- read.table("filename.txt", header = TRUE, sep = "\t",
na.strings = "NA", fileEncoding = "", encoding = "unknown")
Fixed-width-column formats
read.fwf
Matrices
A <- matrix(scan("matrix.dat", n = 200*2000), 200, 2000, byrow = TRUE)
We are going to use the made up sample-data
dataset.
# View the Data
sample-data
# First line
first(sample-data)
# Dimensions
dim(sample-data)
# Column Names
names(sample-data)
# Column Structure & Data Type
str(sample-data)
# Investigate the structure of specific columns
sample-data %>%
select(column-name, column-b-name) %>%
str()
# View a column
sample-data$column-name
# Take a Sample (dplyr)
sample_n(sample-data)
View as a table:
table(object/list)
Sample Size:
- Sample Size -
n()
Measures of Centre:
(Often has option to exclude null values by setting argument "na.rm = TRUE")
- Mean -
mean()
- Median -
median()
- Mode:
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
result <- getmode(vector)
print(result)
Measures of Spread:
- Range -
range()
- Minimum -
min()
- Maximum -
max()
- Standard Deviation -
sd()
- Variance -
var()
- IQR -
IQR()
Create a list of summary metrics
# Dplyr function
sample-data %>%
summarise(mean_dd = mean(column-name), sd_dd = sd(column-name), n = n())
Tabulate Categorical Data:
sample-data %>%
group_by(column-name) %>%
summarise(count = n())
Select Top n
head(n)
Filter
# Dplyr function
# AND
new-data-frame <- sample-data %>%
filter(column-name-a == "text", column-name-b == 999)
# OR
new-data-frame <- sample-data %>%
filter(column-name-a == "text" | column-name-b == 999)
# Null values
new-data-frame <- sample-data %>%
filter(!is.na(column-name))
Select
new-data-frame <- sample-data %>%
select(column-name-a, column-name-b)
Group
sample-data %>%
group_by(column-name-a) %>%
summarise(mean = mean(column-name-b), sd = sd(column-name-b), n = n())
We use the ggplot2
package for visualisation
Example Scatter Plot
ggplot(data = sample-data, aes(x = column-name-a, y = column-name-b)) +
geom_point()
- Scatter Plot -
geom_point()
- Line Plot -
geom_line()
- Histogram -
geom_histogram()
orstat_bin()
- Args: # bins:
bins
, bin width:binwidth
- Args: # bins:
- Box Plot -
geom_boxplot()
- Requires categorical variable on the x axis, so
factor()
if variable is not already factor type.
- Requires categorical variable on the x axis, so
- Bar Chart -
geom_bar()
- Stacked Bar Chart:
aes(fill = column-name-a)
- Stacked Bar Chart:
Standard Operators:
- Assignment operator
<-
- Piping operator
%>%
Many of the data manipulation verbs in this section come from the dplyr
package.
- arrange()
- distinct()
- mutate()
Adds a new column called output-column-name
to sample-data
sample-data <- sample-data %>%
mutate(output-column-name = column-name-a + column-name-b)
Arrange a column in descending order
sample-data <- sample-data %>% arrange(desc(column-name))
Change a factor type column into numeric data type (via character data type). (Often useful for ordinal categorical variables)
sample-data <- sample-data %>%
mutate(output-column-name = as.numeric(as.character(column-name)))
- Ternary If/Else -
ifelse(column-name < 1, "FOO", "BAR"))
Sampling
coin_outcomes <- c("heads", "tails") #c() - combine values into vector/list
sim_unfair_coin <- sample(coin_outcomes, size = 1000, replace = TRUE, prob = c(0.2, 0.8))