-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathData Science project - Plan.txt
67 lines (54 loc) · 1.78 KB
/
Data Science project - Plan.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Data Science Project Plan
KEY INFORMATION
y-variable: funded (1) or (expired)
Data variations
- All loan data
- All loan data + country data
- All loan data + country data + partner data
- All loan data + country data + partner data + text clustering on loan_use variable
Legend
*- completed
>- next steaps
PLAN
1. Data visualitions to give an overview of the data
overview
* loan count: by country, by year, by sector
status_bin
* by country (map)
* by year
* by sector
* by country by year (map)
2. Clean data
* clear columns that have lots of nas
* drop rows of variables you want that have nas
3. Logistic Regression
> Run a logistic regression for each vairable-> see which variable is the best predictor
> Run a logistic regression with all variables-> see which variables are the most important
> Get features from this and use for decision trees
4. Decision Tree
* Create decision tree
* Evaluate (accuracy score and ROC)
* Finetuning the tree
* Compare against logistic
5. Random Forest
- create RF (likely to give better results than DT)
- Evaluate (accuracy score and ROC)
- Finetuning
- Compare against Logistic + Decision tree
6. Applying clustering analysis to loan data
* Clean data
* Create document-term matrix
* Apply LDA model
* Visualisation
7. Conclusions
OTHER ITEMS
Things to consider:
- Skewed dependent variable: does it affect my model and results?
- Make sure you define what you are predicting (trying to predict funded or expired?)
Other ideas:
- Restrict sameple to after the 30 day expiry got introduced (gets rid of unhelpful data)
- Alternative y-variable
Using funded data: time it took to fund as
Using expired data: share of loan that was funded
- Recommendation engines for recommending loans to particular lenders
- Applying clustering analysis to loan data