-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path03-Dataset.Rmd
159 lines (109 loc) · 9.89 KB
/
03-Dataset.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# Introduction to the Data
Before tackling analytics problem, we start by introducing data to be analyzed in later chapters.
## Customer Data for a Clothing Company
Our first data set represents customers of a clothing company who sells products in physical stores and online. This data is typical of what one might get from a company's marketing data base (the data base will have more data than the one we show here). This data includes 1000 customers:
1. Demography
- `age`: age of the respondent
- `gender`: male/female
- `house`: 0/1 variable indicating if the customer owns a house or not
1. Sales in the past year
- `store_exp`: expense in store
- `online_exp`: expense online
- `store_trans`: times of store purchase
- `online_trans`: times of online purchase
1. Survey on product preference
It is common for companies to survey their customers and draw insights to guide future marketing activities. The survey is as below:
How strongly do you agree or disagree with the following statements:
1. Strong disagree
1. Disagree
1. Neither agree nor disagree
1. Agree
1. Strongly agree
- Q1. I like to buy clothes from different brands
- Q2. I buy almost all my clothes from some of my favorite brands
- Q3. I like to buy premium brands
- Q4. Quality is the most important factor in my purchasing decision
- Q5. Style is the most important factor in my purchasing decision
- Q6. I prefer to buy clothes in store
- Q7. I prefer to buy clothes online
- Q8. Price is important
- Q9. I like to try different styles
- Q10. I like to make decision myself and don't need too much of others' suggestions
There are 4 segments of customers:
1. Price
1. Conspicuous
1. Quality
1. Style
Let's check it:
```{r, echo=FALSE}
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
```
```{r}
str(sim.dat,vec.len=3)
```
Refer to Appendix for the simulation code.
<!--
## Customer Satisfaction Survey Data from Airline Company
This data set is from a customer satisfaction survey for three airline companies. There are `N=1000` respondents and 15 questions. The market researcher asked respondents to recall the experience with different airline companies and assign a score (1-9) to each airline company for all the 15 questions. The higher the score, the more satisfied the customer to the specific item. The 15 questions are of 4 types (the variable names are in the parentheses):
- How satisfied are you with our______?
1. Ticketing
- Ease of making reservation(Easy_Reservation)
- Availability of preferred seats(Preferred_Seats)
- Variety of flight options(Flight_Options)
- Ticket prices(Ticket_Prices)
1. Aircraft
- Seat comfort(Seat_Comfort)
- Roominess of seat area(Seat_Roominess)
- Availability of Overhead(Overhead_Storage)
- Cleanliness of aircraft(Clean_Aircraft)
1. Service
- Courtesy of flight attendant(Courtesy)
- Friendliness(Friendliness)
- Helpfulness(Helpfulness)
- Food and drinks(Service)
1. General
- Overall satisfaction(Satisfaction)
- Purchase again(Fly_Again)
- Willingness to recommend(Recommend)
Now check the data frame we have:
```{r, echo=FALSE}
rating<-read.csv("http://bit.ly/2TNQ6TK")
```
```{r}
str(rating,vec.len=3)
```
Refer to Appendix for the simulation code.
-->
## Swine Disease Breakout Data {#swinediseasedata}
The swine disease data includes 120 simulated survey questions from 800 farms. There are three choices for each question. The outbreak status for the $i^{th}$ farm is generated from a $Bernoulli(1, p_i)$ distribution with $p_i$ being a function of the question answers:
$$ln(\frac{p_i}{1-p_i})=\beta_0 + \Sigma_{g=1}^G\symbf{x_{i,g}^T\beta_{g}}$$
where $\beta_0$ is the intercept, $\mathbf{x_{i,g}}$ is a three-dimensional indication vector for question answer and $\symbf{\beta_g}$ is the parameter vector corresponding to the $g^{th}$ predictor. Three types of questions are considered regarding their effects on the outcome. The first forty survey questions are important questions such that the coefficients of the three answers to these
questions are all different:
$$\symbf{\beta_g}=(1,0,-1)\times \gamma,\ g=1,\dots,40$$
The second forty survey questions are also important questions but only one answer has a coefficient that is different from the other two answers:
$$\symbf{\beta_g}=(1,0,0)\times \gamma,\ g=41,\dots,80$$
The last forty survey questions are also unimportant questions such that all three answers have the same coefficients:
$$\symbf{\beta_g}=(0,0,0)\times \gamma,\ g=81,\dots,120$$
The baseline coefficient $\beta_0$ is set to be $-\frac{40}{3}\gamma$ so that on average a farm have 50% of chance to have an outbreak. The parameter $\gamma$ in the above simulation is set to control the strength of the questions' effect on the outcome. In this simulation study, we consider the situations where $\gamma = 0.1, 0.25, 0.5, 1, 2$. So the parameter settings are:
$$\symbf{\beta^{T}} = \left(\underset{question\ 1}{\frac{40}{3},\underbrace{1,0,-1}},...,\underset{question\ 41}{\underbrace{1,0,0}},...,\underset{question\ 81}{\underbrace{0,0,0}},...,\underset{question\ 120}{\underbrace{0,0,0}}\right)*\gamma$$
For each value of $\gamma$, 20 data sets are simulated. The bigger $\gamma$ is, the larger the corresponding parameter. We provided the data sets with $\gamma = 2$. Let's check the data:
```{r}
disease_dat <- read.csv("http://bit.ly/2KXb1Qi")
# only show the last 7 columns here
head(subset(disease_dat,select=c("Q118.A","Q118.B","Q119.A",
"Q119.B","Q120.A","Q120.B","y")))
```
Here `y` indicates the outbreak situation of the farms. `y=1` means there is an outbreak in 5 years after the survey. The rest columns indicate survey responses. For example `Q120.A = 1` means the respondent chose `A` in Q120. We consider `C` as the baseline.
Refer to Appendix for the simulation code.
## MNIST Dataset
The MNIST dataset is a popular dataset for image classification machine learning model tutorials. It is conveniently included in the Keras library and ready to be loaded with build-in functions for analysis. The WIKI page of MNIST provides a detailed description of the dataset: https://en.wikipedia.org/wiki/MNIST_database. It contains 70,000 images of handwritten digits from American Census Bureau employees and American high school students. There are 60,000 training images and 10,000 testing images. Each image has a resolution of 28 x 28, and the numerical pixel values are in greyscale. Each image is represented by a 28 x 28 matrix with each element of the matrix an integer between 0 and 255. The label of each image is the intended digit of the handwritten image between 0 and 9. We cover the detailed steps to explore the MNIST dataset in the R and Python notebooks. A sample of the dataset is illustrated in figure \@ref(fig:mnistdata). ^[The image is from https://en.wikipedia.org/wiki/File:MnistExamples.png]
```{r mnistdata, fig.cap = "Sample of MNIST dataset", out.width="70%", fig.asp=.75, fig.align="center", echo = FALSE}
knitr::include_graphics("images/MnistExamples.png")
```
## IMDB Dataset
The IMDB dataset (http://ai.stanford.edu/~amaas/data/sentiment/) is a popular dataset for text and language-related machine learning tutorials. It is also conveniently included in the Keras library, and there are a few build-in functions in Keras for data loading and pre-processing. It contains 50,000 movie reviews (25,000 in training and 25,000 in testing) from IMDB, as well as each movie review’s binary sentiment: positive or negative. The raw data contains the text of each movie review, and it has to be pre-processed before being fitted with any machine learning models. By using Keras’s built-in functions, we can easily get the processed dataset (i.e., a numerical data frame) for machine learning algorithms. Keras’ build-in functions perform the following tasks to convert the raw review text into a data frame:
1. Convert text data into numerical data. Machine learning models cannot work with raw text data directly, and we have to convert text into numbers. There are many different ways for the conversion and Keras’ build-in function uses each word’s rank of frequency in the entire training dataset to replace the raw text in both the training and testing dataset. For example, the 10th most frequent word is replaced by integer 10. There are a few additional setups for this process, including:
a. Skip top frequent words. We usually skip a few top frequent words as they are mainly stopwords like “the” “and” or “a,” which usually do not provide much information. There is a parameter in the build-in function to specify how many top words to skip.
b. Set the maximum number of unique words. The entire vocabulary of the unique words in the training dataset may be large, and many of them have very low frequencies such as just appearing once in the entire training dataset. To keep the size of the vocabulary, we can also set up the maximum number of the unique words using Keras’ built-in function such that any words with least frequencies will be replaced with a special index such as “2”.
2. Padding or truncation to keep all the reviews to be the same length. For most machine learning models, the algorithms expect to see the same number of features (i.e., same number of input columns in the data frame). There is a parameter in the Keras build-in function to set the maximum number of words in each review (i.e., `max_length`). For reviews that have less than `max_length` words, we pad them with “0”. For reviews that have more than `max_length` words, we truncate them.
After the above pre-processing, each review is represented by one row in the data frame. There is one column for the binary positive/negative sentiment, and `max_length` columns input features converted from the raw review text. In the corresponding R and Python notebooks, we will go over the details of the data pre-processing using Keras’ built-in functions.