-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment6.Rmd
167 lines (131 loc) · 6.49 KB
/
Assignment6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "Homework 6"
author: "LIU Aoran"
date: "2024-04-29"
output: pdf_document
---
Full name: LIU Aoran
Preferred name: Aaron
ID: 29246040
```{r,include=FALSE}
# install_github("andrew-griffen/gdata")
library(dplyr)
library(knitr)
library(kableExtra)
library(jtools)
library(devtools)
library(tidyverse)
library(stringi)
library(stringr)
library(ggplot2)
library(gdata)
library(tidyr)
```
# Task 1
```{r}
setwd("/Users/okuran/Desktop/R for empirical economics")
# Read the CSV file into R
netflix_data <- read.csv("netflix_titles.csv")
# Check the structure of the dataset
str(netflix_data)
```
# Task 2
```{r}
# Delete the columns containing "X"
netflix_data <- netflix_data |>
select(-contains("X"))
# Check the structure of the dataset
str(netflix_data)
```
# Task 3
```{r}
# Create the "length" variable for movies
netflix_data <- netflix_data |>
mutate(length = if_else(type == "Movie",
as.numeric(str_replace(str_squish(duration), " min", "")),
NA_real_))
# Create the "seasons" variable for TV shows
netflix_data <- netflix_data |>
mutate(seasons = if_else(type == "TV Show",
as.numeric(str_replace(str_squish(duration), " Season[s]*", "")),
NA_real_))
# Remove the "duration" column as it's no longer needed
netflix_data <- netflix_data |>
select(-duration)
str(netflix_data)
```
# Task 4
```{r}
# Create scatter plot
ggplot(data = netflix_data, aes(x = release_year, y = seasons)) +
geom_point() +
labs(x = "Release Year", y = "Number of Seasons") +
ggtitle("Relationship Between Release Year and Number of Seasons")
```
The scatter plot shows an increase in the number of seasons for TV shows released in more recent years.
This could be due to various factors such as:
1. Streaming Platforms: With the rise of streaming platforms like Netflix, Amazon Prime, and Hulu, there might be increased demand for long-running TV shows to keep subscribers engaged over extended periods.
2. Audience Preferences: Audience preferences may have shifted towards binge-watching longer series, leading producers to create more seasons for popular shows.
3. Economic Factors: Producing additional seasons of successful TV shows can be financially lucrative for production companies and streaming platforms, leading to a greater incentive to extend series.
# Task 5
```{r}
# Create scatter plot for movie length against release year
ggplot(data = netflix_data, aes(x = release_year, y = length)) +
geom_point() +
labs(x = "Release Year", y = "Movie Length (minutes)") +
ggtitle("Relationship Between Release Year and Movie Length")
```
the scatter plot shows a concentration of points at shorter movie lengths for movies released in more recent years, it suggests a trend towards producing more short movies over time.
This observation could be due to various factors such as:
1. Changing Audience Preferences: Audience preferences may have shifted towards shorter content, influenced by changes in viewing habits and attention spans.
2. Diversity in Filmmaking: There may be a growing diversity of filmmakers and storytelling styles, leading to the production of more short films alongside traditional feature-length movies.
3. Accessibility of Filmmaking Tools: Advances in technology have made filmmaking more accessible, allowing filmmakers to create and distribute short films more easily than before.
# Task 6 & 7
```{r}
head(expenditure_data1)
# Compute total expenditure for each individual
expenditure_data1$total_expenditure <-
rowSums(expenditure_data1[, c("food", "clothing", "housing", "alcohol")], na.rm = TRUE)
head(expenditure_data1)
```
Each category of expenditure (food, clothing, housing, alcohol) is represented as a separate column. In tidy data, each variable should be a column. Instead, the categories are spread across multiple columns, which makes it difficult to work with and analyze the data efficiently.
And each row represents an individual, but the expenditures are spread across multiple columns. This violates the tidy data principle that each observation should be in its own row. Instead, the data presents multiple observations (expenditures in different categories) in one row, making it harder to perform certain analyses without reshaping the data.
# Task 8
```{r}
long_expenditure_data <- expenditure_data1 |>
select(-total_expenditure) |> # Exclude the total_expenditure column
pivot_longer(cols = food:alcohol,
names_to = "category",
values_to = "expenditure")
head(long_expenditure_data)
```
# Task 9 & 10
```{r}
total_expenditure_per_individual <- long_expenditure_data |>
group_by(id) |>
summarise(total_expenditure = sum(expenditure))
total_expenditure_per_individual
```
I prefer the second approach. Because it will be more flexible if we want to add more entries later.
With the longer dataset structure, adding new entries simply involves appending additional rows with the corresponding "id", "category", and "expenditure" values. When we re-run the code to calculate total expenditures using group_by() and summarise(), it will automatically include the new entries in the calculation without requiring any changes to the code itself.
# Task 11~13
```{r}
print(expenditure_data2)
```
The expenditure_data2 dataset has 201 columns.
If we were to write code to compute the total expenditures without making the dataset longer, the code might be very long and repetitive as the code to calculate the total expenditure for each individual would involve referencing each column explicitly by name (e.g., c(item1, item2, ..., item201)).
# Task 14 & 15
```{r}
# Reshape the dataset to longer format
long_expenditure_data2 <- expenditure_data2 |>
pivot_longer(cols = starts_with("item"),
names_to = "item",
values_to = "expenditure")
# Calculate total expenditure for each individual
total_expenditure_per_individual2 <- long_expenditure_data2 |>
group_by(id) |>
summarise(total_expenditure = sum(expenditure, na.rm = TRUE))
total_expenditure_per_individual2
```
By reshaping the dataset into a longer format, we reduce the need for repetitive code. Instead of referencing each expenditure item individually in the code, we transform multiple columns into rows, making the code more concise.
And the longer format results in code that is easier to read and understand. It clearly communicates the data transformation steps (reshaping the dataset) and the analysis steps (calculating total expenditure), making the code more intuitive for others to follow.