-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathdata_analysis_documentation_code.Rmd
600 lines (341 loc) · 21.3 KB
/
data_analysis_documentation_code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
---
title: "Data Challenge Document for Capital One"
author: "Aadish Chopra"
date: "8/7/2019"
output:
html_document:
toc: true
theme: united
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,fig.width = 10,warning = FALSE,message = FALSE)
options(scipen=99999)
```
```{r loadbootstrap,warning=FALSE,message=FALSE,results='hide'}
# data is assumed to be locally available
# otherwise we can clone the repository and then do a pull
# Load all the libraries
required_packages<-c('knitr','htmlTools','dplyr','htmlTable','stringr','ggplot2','prophet','gridExtra','scales','mice')
load_required_packages<-function(required_packages){
if(!require(required_packages,character.only = TRUE))
{
install.packages(required_packages,repos="http://cran.us.r-project.org")
require(required_packages,character.only = TRUE)
}
else
{
require(required_packages,character.only = TRUE)
}
}
lapply(required_packages,load_required_packages )
# Import the dataset
if(!file.exists('data/listings.csv'))
{
listing_tar<-download.file(url="http://data.insideairbnb.com/united-states/ny/new-york-city/2017-05-02/data/listings.csv.gz",destfile = "data/listings.csv.gz")
untar(tarfile = 'data/listings.csv.gz',exdir = 'data')
}
if(!file.exists('data/Zip_Zhvi_2bedroom.csv')){
untar(tarfile = 'data/Zip_Zhvi_2bedroom.csv.zip',exdir = 'data')
}
AirBNB<-read.csv(file = "data/listings.csv",header = T,sep = ",",stringsAsFactors = F)
ZillowData<-read.csv("data/Zip_Zhvi_2bedroom.csv",header = T,sep = ",",stringsAsFactors = F)
```
```{r capitalonelogo,fig.width=5,fig.height=5}
htmltools::img(src = knitr::image_uri(file.path("images","Capital-One.jpg")),
alt = 'logo',
style = 'position:absolute; top:0; right:0; padding:10px;')
```
# Problem Statement
You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
# Objective
Find properties based on zipcodes in the New York city which would yield the maximum return on investment (ROI)
# Assumptions
1. Occupancy rate has been assumed to be constant throughout the year irrespective of the fact that it can change due to holidays
2. Review score rating has been used to adjudicate the occupancy rate as it is the overall score based on
+ accuracy
+ cleanliness
+ check-in
+ communication
+ location
+ value
3. Cost of property has been predicted using time series forecasting from the Zillow Data set. Extraneous variables like natural calamities, slowdown in economy which can affect the cost of the property have been ignored.
4. Mean and median prices have been taken to calculate the breakeven period.
5. The data was last scraped in the third quarter of the FY 2017 and the analysis is done in 2019. Factors like holiday effects, superhost, location, amenities etc. could effect the price variation for short term rentals and hence prediction of rental price is not done.
# Metadata Created
1. predicted_price -This column contains the predicted cost price from the zillow data set. In the data set only the price as of 1 August 2019 is taken and is in dollar amount
2. occupancy_score -Same as review_scores_rating
3. occupancy_rate -Percentage occupancy of the airbnb listing. It is represented as intervals.
4. breakeven_years -Time it takes for the property to return it's cost price. This is also known as breakeven period and it is taken in the form of years
```{r load_helper_functions}
removethesecolumns<-function(pattern_remove,dataset)
{
message(" removing column ")
message(grep(pattern = pattern_remove,x = names(dataset),value = T))
dataset %>% select(-contains(pattern_remove,ignore.case = TRUE))
}
analyze_distinct_values<-function(column)
{
length(unique(column))
}
strip_money<-function(dataset,pattern){
colnames(dataset[grep(pattern,colnames(dataset))])
}
remove_sign<-function(money,column)
{
gsub(money,replacement = '',x =column )
}
noofNA<-function(column)
{
if(sum(is.na(column))>0)
{
sum(is.na(column))
}
}
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
```
# Exploratory data analysis
## 1. Analyzing Zillow data set
```{r top5rows}
kable(head(ZillowData[,c(1:7,8,262)]),align = 'l',caption="Zillow Data")
```
Zillow has 262 dimensions with 8946 observations.
```{r ZillowDimensions}
kable(x=dim(ZillowData),col.names = "Dimensions Zillow",rownames= c('observations','columns'),align = 'l')
```
There are just 3 missing values in the relevant columns. We can remove these rows.
```{r analyze_zillow,warning=FALSE,message=FALSE,cache=TRUE}
#Filtering for new york City
Z_NY<-ZillowData %>% filter(City=="New York") %>%
select(-c(RegionID,City,State,Metro,CountyName,SizeRank))
colnames(Z_NY)[1]<-"zipcode"
trans_zil = setNames(data.frame(t(Z_NY[,-1])), Z_NY[,1])
present<-data.frame(ds=seq.Date(from = as.Date('1996/04/01'),to = as.Date('2017/06/01'),by = 'month'))
# Predicting prices till Aug 2019
future<-data.frame(ds=seq.Date(from = as.Date('1996/04/01'),to = as.Date('2019/08/01'),by = 'month'))
bind_for_prophet<-apply(trans_zil,2,cbind.data.frame,present)
bind_for_prophet<-lapply(bind_for_prophet, setNames,c('y','ds'))
make_model<-lapply(bind_for_prophet, prophet)
forecast<-lapply(make_model, predict,future)
extract_yhat<-lapply(forecast, function(x) x[, 'yhat'][nrow(future)])
# Set predicted price in original data frame
Z_NY$predicted_price<-unlist(extract_yhat)
# select items for merging
cols<-c('zipcode','X2017.06','predicted_price')
Z_NY_predicted<-Z_NY[,cols]
Z_NY_predicted$zipcode<-as.factor(Z_NY_predicted$zipcode)
```
Following graph shows the trend in the price.<br/>
The forecasting method is inspired from facebook's open source **Prophet** package. Prediction has been solely done to see the price trends.
1. Cost Price trend of zipcode **11231 **
2. Cost Price trend of zipcode **11217 **
The plots show an increasing trend in cost prices.
```{r dyplots}
dyplot.prophet(make_model$`11231`,forecast$`11231`)
dyplot.prophet(make_model$`11217`,forecast$`11217`)
```
## 2. Exploring AIRBNB data
```{r AIRBNBdimension}
kable(x=dim(AirBNB),col.names = "Dimensions Airbnb",rownames= c('observations','columns'),align = 'l')
```
AirBNB has 95 dimensions with 40753 observations
```{r Airbnbdata}
kable(head(AirBNB[,c('id','neighbourhood_group_cleansed','city','state','zipcode','property_type','bedrooms','price','review_scores_rating')]),caption = "AIRBNB data")
```
```{r viewdata ,eval=FALSE,echo=FALSE}
# Looking at the first 90 columns
out_table <- cbind(names(AirBNB)[1:30],names(AirBNB)[31:60],names(AirBNB)[61:90])
htmlTable(out_table,
cgroup = c("Set 1:30", "Set 31:60","Set 61:90"),
n.cgroup = c(3),
rnames = FALSE)
```
Price and other columns associated with money are listed in dollars, For example, $50. In order to do analysis, the dollar sign is stripped off of the entire column
```{r columntypes,warning=FALSE,message=FALSE}
#stripping the dollar sign off
#identify which columns needs stripping
remove_dollar<-strip_money(pattern = 'price',dataset = AirBNB)
additional_columns<-c('security_deposit','cleaning_fee','extra_people')
remove_dollar<-c(remove_dollar,additional_columns)
#strip dollar amount from these columns
AirBNB[,remove_dollar]<-apply(AirBNB[,remove_dollar],2,FUN = remove_sign,money = "\\$")
# convert characters to numeric columns
AirBNB[,remove_dollar]<-apply(AirBNB[,remove_dollar],2, FUN =as.numeric)
```
Since the data was loaded with the option string as Factors =FALSE, data type manipulation is done
URL columns would not be useful, therefore removing these columns .
```{r removeurlcolumns}
AirBNB<-removethesecolumns(pattern_remove = "url",dataset = AirBNB)
AirBNB<-removethesecolumns(pattern_remove = "scrape",dataset = AirBNB)
```
### Missing value analysis
```{r NAs}
how_many_NA<-data.frame('missing values'=sort(unlist(apply(AirBNB, 2, noofNA)),decreasing = T))
kable(x = how_many_NA ,caption = "Missing value analysis")
AirBNB<-AirBNB %>% select(-c('license','has_availability'))
```
Columns 'license' and 'has availability' have no data. So, these columns can be removed.
Distribution of **bedroom's variable**
1. 69 missing values or 0.17%
2. frequency distribution shows that there are 3525 properties which are having 0 bedrooms.
```{r dist_bedrooms}
# let us see the distribution of the bedrooms variable
barplot(table(AirBNB$bedrooms),col = "blue",main = "Distribution of the bedrooms variable",xlab = "No of Bedrooms",ylab="Count")
```
Data integrity of **Zipcodes** <br/>
Zipcode should be of length 5
```{r zipcode_integrity}
kable(x=sum(str_count(AirBNB$zipcode,pattern = "[0-9]")==5),caption = "Number of zipcodes having correct length",col.names = 'zipcode',align='l')
kable(x=nrow(AirBNB)-sum(str_count(AirBNB$zipcode,pattern = "[0-9]")==5),caption="Number of zipcodes having incorrect length",col.names = 'zipcode',align = 'l')
```
There are columns which are disseminating no information. Analyzing columns for unique values.
```{r delete_no_information}
x=sort(apply(AirBNB,2,analyze_distinct_values),decreasing =T )
x=x[x==1]
kable(x,col.names = 'Distinct Values')
AirBNB<-AirBNB %>% select(-c('requires_license','experiences_offered'))
```
From distinct value analysis, columns 'requires_license' and 'experiences_offered' have no information at all <br/>
Since the consultancy company has already identified that 2 bedroom properties are the most profitable the Airbnb dataset is filtered for 2 bedrooms
Code can be reused based on consulting company's requirements .If they want to analyze 1 bedroom properties in 'Chicago' the variables can be modified as<br/>
<span style="color:red">**no_of_bedrooms=1**</span><br/>
<span style="color:red">**city='Chicago'**</span>
```{r filter_2_rooms}
# set variable here in case the consulting company needs to do analysis on other properties
no_of_bedrooms=2
air_two_room_property<-AirBNB %>% filter(AirBNB$bedrooms ==no_of_bedrooms)
how_many_NA_2_bedrooms<-data.frame(MissingValues=sort(unlist(apply(air_two_room_property,MARGIN = 2, noofNA)),decreasing = T))
```
```{r results='hide'}
kable(how_many_NA_2_bedrooms,caption = "Missing values in the filtered data set")
```
**Property Type**
```{r property_type}
AirBNB %>% select(property_type)%>% group_by(property_type) %>% summarise(count=n()) %>% arrange(desc(count)) %>% filter(count>100) %>% ggplot(data=.,mapping = aes(x=reorder(property_type,-count),y=count))+geom_bar(stat = 'identity',fill='cyan')+xlab("property_type")+ggtitle("Frequency plot of the different types of properties")
```
Majority of properties are of **Apartment** type which was expected in a city like New York. We would have factored in property_type had there been an equitable distribution.
# Data Integrity and Data Quality
After doing exploratory data analysis and data munging, we can comment on the data quality and data integrity
I concur with the statement *"Bad data is worse than no data"*.
1. Missing values : There were a few columns in which missing values were present. The variables important for our analysis are
+ review_score_rating : occupancy_rate is directly dependent on the score
+ price : There are 41 properties which do not have their price listed.
2. Data Quality : Some zipcodes had length other than 5 which questions the data credibility
3. Quality Check of data was performed by
+ removing inconsistencies : zipcodes of abnormal length, missing values in price columns, rating columns
+ conversion into suitable data type : There are certain columns which were reformatted for calculation,data type was changed from character to factor, or from factor to numeric based on the charting requirements as well as limitation of the tool at hand.
4. Uniqueness/Variance : Some of the columns had a low count of unique values. Although low count of unique values are a distinguished feature of binary variables(0 or 1, TRUE or FALSE, this or that) but there should not be a single value i.e no change in variable throughout the dataset. Such columns have been removed or not considered for analysis
5. Timeliness: The data was last scraped in the third quarter of the FY 2017 and the analysis is done in 2019. Prediction could have been done (is done but not used) for Zillow properties but we don't have 2019 Airbnb data our analysis is restricted to FY 2017
```{r merging_by_zipcode}
two_room_property<-merge(air_two_room_property,Z_NY_predicted,by = 'zipcode',suffixes = c('air','zil'))
two_room_property$neighbourhood_group_cleansed<-as.factor(two_room_property$neighbourhood_group_cleansed)
```
# Visualizations of the properties versus price and other parameters
There are few zipcodes which have over 70 properties. The <span style="color:red">red dotted line</span> indicates the arbitrary y-axis intercept at 70
```{r visualizations,fig.width=10,message=FALSE,warning=FALSE }
theme_set(theme_bw())
two_room_property %>% select(zipcode,price) %>% filter(zipcode>0)%>% group_by(zipcode)%>% summarise(n_count=n()) %>% arrange(desc(n_count))%>% top_n(n = 25)%>%
ggplot(data = .,mapping = aes(reorder(zipcode,-n_count),n_count,group=1))+geom_line(stat = 'identity')+geom_hline(yintercept = 70,linetype='dashed',color='red')+ggtitle("Count of properties plotted against zipcode")+xlab('zipcode')+ylab('count')
```
The boxplot shows
```{r visualizations1,fig.width=10,message=FALSE,warning=FALSE }
# boxplot to show variations in price within a zipcode
ggplot(data=two_room_property,mapping = aes(zipcode,price))+
geom_boxplot(outlier.colour = "red",varwidth = TRUE)+ggtitle("Variation of airbnb price within a zipcode")
```
```{r visualizations2,fig.width=10,message=FALSE,warning=FALSE}
# In which neighborhood these properties are located
two_room_property %>% group_by(neighbourhood_group_cleansed) %>% summarise(avg_price=mean(price,na.rm = T)) %>% arrange(desc(avg_price)) %>% select(neighbourhood_group_cleansed,avg_price)%>% ggplot(.,mapping = aes(neighbourhood_group_cleansed,avg_price,fill=neighbourhood_group_cleansed))+geom_bar(stat='identity')+xlab("neighbourhood")+ggtitle("Variation of average airbnb price by neighborhood")
```
```{r visualizations3,fig.width=12,message=FALSE,warning=FALSE}
# which zipcodes are located in which location
ggplot(data=two_room_property,mapping=aes(zipcode,fill=neighbourhood_group_cleansed))+
geom_histogram(stat='count')+facet_grid(neighbourhood_group_cleansed~.)+ggtitle("Count of properties by neighbourhood and zipcode")
```
```{r visualizations4,fig.width=10,message=FALSE,warning=FALSE}
p1<-ggplot(data = Z_NY_predicted,mapping = aes(x=reorder(zipcode,-predicted_price),y=predicted_price))+
geom_bar(stat='identity')+ggtitle("Median cost price predicted for August 2019 by zipcode",subtitle = "Zillow")+xlab("zipcode")+ylab("predicted price")
p2<-ggplot(data = Z_NY_predicted,mapping = aes(x=reorder(zipcode,-X2017.06),y=X2017.06))+
geom_bar(stat='identity')+ggtitle("Median cost price actual for June 2017 by zipcode",subtitle = "Zillow ")+xlab("zipcode")+ylab("actual price")
grid.arrange(arrangeGrob(p1, p2))
```
1. Three phases can be seen in the cost price
+ Zipcodes with very high price (10013,10014,10011)
+ Zipcodes with medium price (10023,10028...100217)
+ Zipcodes with low price (100215,....100304)
```{r visualizations5,fig.width=10,message=FALSE,warning=FALSE}
two_room_property %>% select(zipcode,price) %>% filter(zipcode>0)%>% group_by(zipcode)%>% summarise(avg_price=mean(price,na.rm = T),count=n()) %>%arrange(desc(avg_price),count)%>%mutate_if(is.numeric,round,digits=0)%>% top_n(n = 25)%>% ggplot(.,mapping = aes(reorder(zipcode,-avg_price),avg_price))+geom_bar(stat = "identity",fill='steelblue')+ggtitle("Plot of average AIRBNB price against zipcode")+xlab('zipcode')
two_room_property %>% select(zipcode,price) %>% filter(zipcode>0)%>% group_by(zipcode)%>% summarise(median_price=median(price,na.rm = T),count=n()) %>%arrange(desc(median_price),desc(count))%>%mutate_if(is.numeric,round,digits=0)%>% top_n(n = 25)%>% ggplot(.,mapping = aes(reorder(zipcode,-median_price),median_price))+geom_bar(stat = "identity",fill="turquoise")+ggtitle("Plot of median AIRBNB price against zipcode")+xlab("zipcode")
```
Average price is decreasing from 320 to 70
Median price is decreasing from 300 to 70
Averages are bit higher than median price suggesting averages are being pulled over because of outliers.
```{r price_distribution}
plot(density(two_room_property$price,na.rm = T),main = "Distribution of airbnb price",xlab = "price")
abline(v=750,col= "red",lty="dashed")
```
The distribution of airbnb price is right skewed. The <span style="color:red">red dotted line</span> indicates the arbitrary x-axis intercept at 750 which is relatively higher compared to other Airbnb prices and corroborated by the outlier plot.
```{r dist_review_score_rating}
plot(density(two_room_property$review_scores_rating,na.rm = T),main = "Distribution of review scores",xlab = "review_score_rating")
```
It is skewed towards the left which means that majority of the properties have received a score greater than 50
# Which properties to invest in ?
#### Formula used to calculate ROI
$$Breakeven-Period=\frac{CostPrice}{ ShortTerm Rental Price* Occupancy Rate * Time Period }$$
<span style="color:blue">Time Period is taken as year </span>
```{r mergeanalysis}
two_room_property$occupancy_score=two_room_property$review_scores_rating
kable(data.frame(occupancy_score=c('75-100','50-75','25-50','0-25'),occupancy_rate=c('75%','65%','55%','45%')),caption = "Occupancy rate based on review score and number of reviews")
two_room_property$occupancy_rate<-cut(two_room_property$occupancy_score,
breaks = c(0,25,50,75,100),
labels = c(".45", ".55", ".65", ".75"),
right = TRUE)
two_room_property$occupancy_rate=as.numeric(as.character(two_room_property$occupancy_rate))
# variable parameters
days=30
months=12
timePeriod=days*months
Cost_Price='X2017.06'
Sell_Price='price'
Occupancy_rate='occupancy_rate'
two_room_property$breakeven_years=two_room_property[,Cost_Price]/(two_room_property[,Sell_Price]*two_room_property[,Occupancy_rate]*timePeriod)
# profit-loss matrix by taking averages across the zipcodes
ROI_mean<-two_room_property %>% group_by(zipcode) %>% summarise(breakeven_return_in_years=mean(breakeven_years,na.rm = T),avg_airbnb_price=mean(price,na.rm = T),avg_cost_price=mean(X2017.06,na.rm = T),count=n()) %>% arrange((breakeven_return_in_years))
kable(ROI_mean,caption = "ROI using the mean price in years")
# profit-loss matrix by taking median across the zipcodes
ROI_median<-two_room_property %>% group_by(zipcode) %>% summarise(breakeven_return_in_years=median(breakeven_years,na.rm = T),median_airbnb_price=median(price,na.rm = T),median_cost_price=median(X2017.06,na.rm = T),count=n()) %>% arrange((breakeven_return_in_years))
kable(ROI_median,caption = "ROI using the median price in years")
```
For the purpose of decision making, sufficient data points are needed to calculate ROI, hence zipcodes with less than 20 properties are filtered out.
```{r atleast20count}
ROI_mean<-ROI_mean %>% filter(count >20)
ROI_median<-ROI_median %>% filter(count >20)
```
<font size="1">* NA in the table is due to missing review_score_rating. Since the count of such properties was 1, it was not imputed but rather left as it is.</font>
Visualing the ROI against the zipcodes
```{r breakevenplots}
ggplot(ROI_mean,mapping = aes(x=zipcode,y=breakeven_return_in_years))+
geom_bar(mapping = aes(reorder(zipcode,breakeven_return_in_years)),stat='identity',fill="orchid")+
ggtitle("Breakeven analysis taking the mean price ")
ggplot(ROI_median,mapping = aes(x=zipcode,y=breakeven_return_in_years))+
geom_bar(mapping = aes(reorder(zipcode,breakeven_return_in_years)),stat='identity',fill="orange")+
ggtitle("Breakeven analysis taking the median price ")
```
Top 5 zipcodes to invest in are the same for the two metrics which is Median and Average.
# Conclusion
The ROI is less for some zipcodes but there are not many properties. We atleast need a few properties to be able to make a firm decision.
Therefore the zipcodes to invest in are
+ <span style="color:blue">11231</span>
+ <span style="color:blue">11217</span>
+ <span style="color:blue">11215</span>
+ <span style="color:blue">10036</span>
+ <span style="color:blue">10025</span>
# Future Steps
1. There are over 35,000 observations besides 2 bedrooms hence further analysis can be done.
2. Factors such as transportation, proximity to work location can be used to map the area for long-term rentals
3. NY being a financial capital has a scope for long term rentals. Hence, the properties can be rented out on a long term basis as well.
4. There are 618 rows which have improper length zipcodes. These can be imputed using the latitude and longitudes
5. Crime score is an important factor in determining the price. Crime score is freely available online and can be used for deeper analysis.
6. Majority of the tourists book short term rentals in NY city hence holidays (variable pricing can be done ) and closeness to tourist locations should be accounted for.