day2/ME114_assignment2_solution.Rmd

---
title: "ME114 Day 2: Solutions for Assignment 2"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---


## Exercise 2.1

This exercise relates to the `College` data set, which can be found in the file `College.csv` on the website for the main course textbook [@james2013] http://www-bcf.usc.edu/~gareth/ISL/data.html. It contains a number of variables for 777 different universities and colleges in the US. The variables are

* `Private` : Public/private indicator
* `Apps` : Number of applications received
* `Accept` : Number of applicants accepted
* `Enroll` : Number of new students enrolled
* `Top10perc` : New students from top 10% of high school class 
* `Top25perc` : New students from top 25% of high school class 
* `F.Undergrad` : Number of full-time undergraduates
* `P.Undergrad` : Number of part-time undergraduates
* `Outstate` : Out-of-state tuition
* `Room.Board` : Room and board costs
* `Books` : Estimated book costs
* `Personal` : Estimated personal spending
* `PhD` : Percent of faculty with Ph.D.'s
* `Terminal` : Percent of faculty with terminal degree
* `S.F.Ratio` : Student/faculty ratio
* `perc.alumni` : Percent of alumni who donate
* `Expend` : Instructional expenditure per student
* `Grad.Rate` : Graduation rate

Before reading the data into `R`, it can be viewed in Excel or a text editor.

(a) Use the `read.csv()` function to read the data into `R`. Call the loaded data `college`. Make sure that you have the directory set to the correct location for the data.

```{r}
college <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/College.csv")
```

(b) Look at the data using the `View()` function.  This loads a matrix or data.frame object into the spreadhseet-like viewer in RStudio, just clicking the name of the object will do in the Environment panel.  You should notice that the first column is just the name of each university. We don't really want `R` to treat this as data. However, it may be handy to have these names for later. Try the following commands:

```{r}
# View(college)
rownames(college) <- college[, 1]
```

You should see that there is now a `row.names` column with the name of each university recorded. This means that `R` has given each row a name corresponding to the appropriate university. `R` will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try

```{r}
college <- college[, -1]
head(college)
```

Now you should see that the first data column is `Private`. Note that another column labeled `row.names` now appears before the `Private` column. However, this is not a data column but rather the name that `R` is giving to each row.

(c)
   i. Use the `summary()` function to produce a numerical summary of the variables in the data set.

```{r}
summary(college)
```

   ii. Use the `pairs()` function to produce a scatterplot matrix of the first ten columns or variables of the data. Recall that you can reference the first ten columns of a matrix `A` using `A[,1:10]`.

```{r}
pairs(college[, 1:10])
```
   
   iii. Use the `plot()` function to produce side-by-side boxplots of `Outstate` versus `Private`.
 
```{r}
plot(college$Private, college$Outstate,
     xlab = "Private University", ylab = "Tuition in $")
```

**Boxplots of Outstate versus Private: Private universities have more out of state students** 
   
   iv. Create a new qualitative variable, called `Elite`, by *binning* the `Top10perc` variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

```{r} 
Elite <- rep("No", nrow(college))
Elite[college$Top10perc > 50] <- "Yes"
Elite <- as.factor(Elite)
college <- data.frame(college, Elite)
```

Use the `summary()` function to see how many elite universities there are. Now use the `plot()` function to produce side-by-side boxplots of `Outstate` versus `Elite`.

```{r}
summary(Elite)
plot(college$Elite, college$Outstate, 
     xlab = "Elite University", ylab = "Tuition in $")
```

**Boxplots of Outstate versus Elite: Elite universities have more out of state students**
   
   v. Use the `hist()` function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command `par(mfrow=c(2,2))` useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.
 
```{r} 
par(mfrow=c(2,2))
hist(college$Apps, xlab = "Applications Received", main = "")
hist(college$perc.alumni, col=2, xlab = "% of alumni who donate", main = "")
hist(college$S.F.Ratio, col=3, breaks=10, xlab = "Student/faculty ratio", main = "")
hist(college$Expend, breaks=100, xlab = "Instructional expenditure per student", main = "")
```   

   vi. Continue exploring the data, and provide a brief summary of what you discover.

```{r}

# Some interesting observations:

# what is the university with the most students in the top 10% of class
row.names(college)[which.max(college$Top10perc)]  

acceptance_rate <- college$Accept / college$Apps

# what university has the smallest acceptance rate
row.names(college)[which.min(acceptance_rate)]  

# what university has the most liberal acceptance rate
row.names(college)[which.max(acceptance_rate)]

# High tuition correlates to high graduation rate
plot(college$Outstate, college$Grad.Rate) 

# Colleges with low acceptance rate tend to have low S:F ratio.
plot(college$Accept / college$Apps, college$S.F.Ratio) 

# Colleges with the most students from top 10% perc don't necessarily have
# the highest graduation rate. Also, rate > 100 is erroneous!
plot(college$Top10perc, college$Grad.Rate)
```


## Exercise 2.2

This exercise involves the `Auto` data set available as `Auto.csv` from the website for the main course textbook [@james2013] http://www-bcf.usc.edu/~gareth/ISL/data.html. Make sure that the missing values have been removed from the data.

```{r}
Auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", 
                 header = TRUE, na.strings = "?")
Auto <- na.omit(Auto)
dim(Auto)
summary(Auto)
```

(a) Which of the predictors are quantitative, and which are qualitative?

Note: Sometimes when you load a dataset, a qualitative variable might have a numeric value.  For instance, the `origin` variable is qualitative, but has integer values of 1, 2, 3.  From mysterious sources (Googling), we know that this variable is coded `1 = usa; 2 = europe; 3 = japan`.  So we can covert it into a factor, using:

```{r}
Auto$originf <- factor(Auto$origin, labels = c("usa", "europe", "japan"))
with(Auto, table(originf, origin))
```

**Quantitative: mpg, cylinders, displacement, horsepower, weight, acceleration, year. Qualitative: name, origin, originf**


(b) What is the *range* of each quantitative predictor? You can answer this using the `range()` function.

```{r}
#Pulling together qualitative predictors
qualitative_columns <- which(names(Auto) %in% c("name", "origin", "originf"))
qualitative_columns

# Apply the range function to the columns of Auto data
# that are not qualitative
sapply(Auto[, -qualitative_columns], range)
```

(c) What is the mean and standard deviation of each quantitative predictor?

```{r}
sapply(Auto[, -qualitative_columns], mean)
sapply(Auto[, -qualitative_columns], sd)
```

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

```{r}
sapply(Auto[-seq(10, 85), -qualitative_columns], mean)
sapply(Auto[-seq(10, 85), -qualitative_columns], sd)
```

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

```{r}
# Part (e):
pairs(Auto)

pairs(Auto[, -qualitative_columns])

# Heavier weight correlates with lower mpg.
with(Auto, plot(mpg, weight))
# More cylinders, less mpg.
with(Auto, plot(mpg, cylinders))
# Cars become more efficient over time.
with(Auto, plot(mpg, year))

# Lets plot some mpg vs. some of our qualitative features: 
# sample just 20 observations
Auto.sample <- Auto[sample(1:nrow(Auto), 20), ]
# order them
Auto.sample <- Auto.sample[order(Auto.sample$mpg), ]
# plot them using a "dotchart"
with(Auto.sample, dotchart(mpg, name, xlab = "mpg"))

with(Auto, plot(originf, mpg), ylab = "mpg")
```

(f) Suppose that we wish to predict gas mileage (`mpg`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `mpg`? Justify your answer.

```{r}
pairs(Auto)
```

**See descriptions of plots in (e). All of the predictors show some correlation with mpg. The name predictor has too little observations per name though, so using this as a predictor is likely to result in overfitting the data and will not generalize well.**


## Exercise 2.3

This exercise involves the `Boston` housing data set.

(a) To begin, load in the `Boston` data set. The `Boston` data set is
part of the `MASS` *package* in `R`, and you can load by executing: 

```{r} 
data(Boston, package = "MASS")
```

Read about the data set:

```{r, eval = FALSE} 
help(Boston, package = "MASS")
```

How many rows are in this data set? How many columns? What do the rows and columns represent?

```{r}
dim(Boston)
```

**506 rows, 14 columns; 14 features, 506 housing values in Boston suburbs.**

(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.

```{r}
pairs(Boston)
```

**Template (X correlates with: a, b, c):**
**crim: age, dis, rad, tax, ptratio;**
**zn: indus, nox, age, lstat;**
**indus: age, dis;**
**nox: age, dis;**
**dis: lstat;**
**lstat: medv.**


(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

```{r}
with(Boston, plot(age, crim, log = "xy")) # Older homes, more crime
with(Boston, plot(dis, crim)) # Closer to work-area, more crime
# looks much more linear with transformed on a log-log scale
with(Boston, plot(dis, crim, log = "xy")) # Closer to work-area, more crime
with(Boston, plot(rad, crim, log = "xy")) # Higher index of accessibility to radial highways, more crime
# as box plots, since rad appears to be categorical
with(Boston, 
     plot(as.factor(rad), log(crim), xlab = "Accessibility to radial highways",
          ylab = "log of crime"))
with(Boston, plot(tax, crim, log = "xy")) # Higher tax rate, more crime
with(Boston, plot(ptratio, crim, log = "xy")) # Higher pupil:teacher ratio, more crime

#Looking directly at correlations.
cor(Boston)
```

(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

```{r}
par(mfrow=c(1,3))
with(Boston, hist(crim[crim > 1], breaks=25))
# most cities have low crime rates, but there is a long tail: 18 suburbs appear
# to have a crime rate > 20, reaching to above 80

with(Boston, hist(tax, breaks=25))
# there is a large divide between suburbs with low tax rates and a peak at 660-680

with(Boston, hist(ptratio, breaks=25))
# a skew towards high ratios, but no particularly high ratios
```

(e) How many of the suburbs in this data set bound the Charles river?￼￼

```{r}
sum(Boston$chas == 1)
```

**35 suburbs**

(f) What is the median pupil-teacher ratio among the towns in this data set?

```{r}
median(Boston$ptratio)
```

**19.05**


(g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your findings.

```{r}
t(subset(Boston, medv == min(medv)))
#              399      406
# crim     38.3518  67.9208 above 3rd quartile
# zn        0.0000   0.0000 at min
# indus    18.1000  18.1000 at 3rd quartile
# chas      0.0000   0.0000 not bounded by river
# nox       0.6930   0.6930 above 3rd quartile
# rm        5.4530   5.6830 below 1st quartile
# age     100.0000 100.0000 at max
# dis       1.4896   1.4254 below 1st quartile
# rad      24.0000  24.0000 at max
# tax     666.0000 666.0000 at 3rd quartile
# ptratio  20.2000  20.2000 at 3rd quartile
# black   396.9000 384.9700 at max; above 1st quartile
# lstat    30.5900  22.9800 above 3rd quartile
# medv      5.0000   5.0000 at min

summary(Boston)
# Not the best place to live, but certainly not the worst.
```

(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling. 
 
```{r}
sum(Boston$rm > 7)
# 64
sum(Boston$rm > 8)
# 13
summary(subset(Boston, rm > 8))
summary(Boston)
# relatively lower crime (comparing range), lower lstat (comparing range)
```