GitHub - Taky0nDon/day_81_housing_model: Create a model to predict property prices (100 Days of Code day 81)

Predict Housing Prices

Relevant Characteristics

* The number of rooms
* Distance to employment centres
* How rich or poor the area is
* How many students there are per teacher in local schools etc

Goals

1. Analyze and explore the Boston house price data

2. Split data for training and testing

3. Run a Multivariable Regression

4. Evaluate how your model's coefficients and residuals (???)

5. Use data transformation to improve model performance

6. Use model to estimate a property price

Imports: "scikit-learn" as "import sklearn"

import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
# TODO: Add missing import statements

Data Characteristics

:Number of Instances: 506

k:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.

:Attribute Information (in order):

CRIM per capita crime rate by town

ZN proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS proportion of non-retail business acres per town

CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX nitric oxides concentration (parts per 10 million)

RM average number of rooms per dwelling

AGE proportion of owner-occupied units built prior to 1940

DIS weighted distances to five Boston employment centres

RAD index of accessibility to radial highways

TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. PRICE Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

Descriptive Statistics:

us DateFrame.describe() to show the mix, max, mean, std dev, count, and quartiles for every column.

Visualizing the Features

GOAL: Use Seaborn's `.displot()` to create a bar chart and superimpose the Kernel Density Estimate for

1. PRICE
2. RM
3. DIS
4. RAD

Adding titles in #seaborn : instance.fig.suptitle("TITLE") Some seaborn plots return a matplotlib Axes object, these are Axes-Level. Others are Figure-Level and return a seaborn object such as a FacetGrid #matplotlib labeling: Axes.set_xlabel(label) Axes.set_ylabel(label)

Run a pair plot

What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?

I would expect pollution to increase as distance to employment decreases. They would be inversely proportional.

What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?

I would expect to home value to increase as the number of rooms increases.

What about the amount of poverty in an area (LSTAT) and home prices?

I would expect area poverty to decrease as home prices increase.

A #pairplot allows you to visual relationships between columns. More on Seaborn pairplots

include a #regression-line:

nox_dis = data[["NOX", "DIS"]]
sns.pairplot(nox_dis, kind="reg", plot_kws={"line_kws": {"color": "cyan"}})

Joint Plots

#jointplot documentation

Make it pretty

with sns.axes_style('darkgrid'):
    sns.jointplot(data=data, x="DIS", y="NOX",
                 height=8,
                 kind="scatter",
                 color="deeppink",
                 joint_kws={'alpha': 0.5})

Using #train_test_split documentation from sklearn.model_selection import train_test_split

#regression

regression = LinearRegression()
regression.fit(X_train, y_train)
intercept = regression.intercept_
slope = regression.coef_
r_sqrd = regression.score(X_train, y_train)
print(f"R squared: {r_sqrd:.2}")

Create subsets

from sklearn.model_selection import train_test_split

X, y = data[[col for col in data.columns if col != "PRICE"]], data["PRICE"]
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=10)

price_regression = LinearRegression()
price_regression.fit(X_train, y_train)
intercept = price_regression.intercept_
slope = price_regression.coef_
print(f"intercept: {intercept}\nslope:{sorted([s for s in slope])}")

^ coefficients OUT:

intercept: 36.53305138282431 slope:[-16.271988951469734, -1.4830135966050273, -0.8203056992885642, -0.581626431182139, -0.12818065642264795, -0.012082071043592574, -0.00757627601533797, 0.011418989022213357, 0.01629221534560711, 0.06319817864608888, 0.30398820612116106, 1.9745145165622597, 3.1084562454033] price_regression.score(X_train, y_train) X_train

Better coefficients:

#dataframe

regr_coef = pd.DataFrame(data=regression.coef_, index=X_train.columns,
                         columns=["Coefficient"])
regr_coef

price_regression.score(X_train, y_train)
X_train

^^ r-squared

Analyze the Estimated Values and Regression Residuals

Residuals are the difference between our model's prediction and the true value from y_train.

Predicted_values = regression.predict(X_train)
residuals = (Y_train - predicted_values`)

challenge one: Actual vs predicated prices:

Actual prices on x-axis, predicted on y
The distance of the data points from the regression line are the residuals

Residuals vs Predicted Values

Predicted price on x-axis
residuals on y-axis

Why are residuals important?

We can determine flaws in our model by analyzing the errors. If there is a pattern, that means we have a systematic error, ie: a flaw in our model. Ideally any errors would be 100% explained by chance.

We are particularly interested in the #skew and #mean of the #residuals A perfect bell curve has a mean distance from the mean of zero and a skew of zero, meaning the graph is completely symmetrical.

#seaborn #format #title

y_hat = price_regression.predict(X_train)
y_i = y_train
actual_vs_predicted_prices = sns.scatterplot(x=y_hat,
                                             y= y_i,
                                            )
actual_vs_predicted_prices.set_title("Actual v Predicted Prices")
actual_vs_predicted_prices.set_xlabel("Predicted Price (Y-hat)")
actual_vs_predicted_prices.set_ylabel("Actual price (yi)")

![[Pasted image 20240124182816.png]] k alt, using bare #matplotlib :

actual = y_train
predicted_values = regression.predict(X_train)

residuals = actual - predicted_values
plt.figure()
plt.scatter(x=y_train, y=predicted_values, c='indigo', alpha=0.6)

# vv i dont understand, why do we do this? vv
plt.plot(y_train, y_train, color='cyan')
# with sns.axes_style('white'):
#     pred_v_actual_regplot = sns.regplot(x=actual, y=predicted_values,)
#     pred_v_actual_regplot.set_xlabel("Actual price")
#     pred_v_actual_regplot.set_ylabel("predicted price")

# with sns.axes_style('darkgrid'):
#     plt.figure()
#     res_vs_predicted_scatter = sns.scatterplot(x=predicted_values, y=residuals)
#     res_vs_predicted_scatter.set_xlabel("Predicted prices")
#     res_vs_predicted_scatter.set_ylabel("Residuals")

Now do predicted price on the x-axis and residual(actual - minus predicted) on the y-axis

#kde super impose kde over histogram representation of a Series:

res_plot = sns.displot(x=residuals, kde=True)

![[Pasted image 20240124190502.png]]

Data transformations

At this point we must either consider a new model entirely, our transforming our data (actual) to make it better fit with our linear model

Is data["PRICE"] a good candidate for log transformation?

target_skew = target.skew()
sns.displot(target, kde=True, color='green')
plt.title(f"Target has a skew of {target_skew: .3}")

price_plot = sns.displot(x=data["PRICE"],
                         kde=True,
                        )
price_skew = data.PRICE.skew()
print(f"Our price data has a skew of {price_skew}")

![[Pasted image 20240124191144.png]]

Use NumPy.log() to create a series with the logarithmic prices Compare the skews. Which is closer to zero?

log_price = np.log(data.PRICE)
print(f"The log data has a skew of {log_price.skew()}")
log_price_plot = sns.displot(x=log_price,
                             kde=True,
                            )

![[Pasted image 20240124191355.png]] The log transformation has a skew much closer to 0.

How does the #log #transformation work?

Every datum is replaced by it's ln (natural log)
Large value are more affected that smaller ones. They are 'compressed'

If we use log prices, our model becomes: $$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

y-hat and y_i

y_hat = price_regression.predict(X_train)
# y_i is actual prices
y_i = y_train

y_hat is predicted prices, y_i is actual prices

y_hat = price_regression.predict(X_train)
# y_i is actual prices
y_i = y_train

actual_vs_predicted_prices = sns.scatterplot(x=y_hat,
                                             y= y_i,
                                            )
actual_vs_predicted_prices.set_title("Actual v Predicted Prices")
actual_vs_predicted_prices.set_xlabel("Predicted Price (Y-hat)")
actual_vs_predicted_prices.set_ylabel("Actual price (yi)")

residuals = y_i - y_hat
# y_train - price_regression.predict(X_train)
res_vs_predicted = sns.scatterplot(x=y_hat,
                                   y=residuals)
res_vs_predicted.set_title("Residuals v Predicted Prices")
res_vs_predicted.set_xlabel("Predicted Price (Y-hat)")
res_vs_predicted.set_ylabel("Residuals")

residuals = (y_train (actual values!!) - regression_object.predict(X_train)(predicted values!!))

to prevent plots from overlapping, call plt.figure() before each plot instantiation,

Cool #matplotlib formatting:

plt.title(f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$', fontsize=17)
plt.xlabel("Actual prices 000s $y _i$', fontsize=14")
plt.ylabel("Predicted prices 000s $\hat y_i$", fontsize=14)
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Multivariable_Regression_and_Valuation_Model_(start).ipynb		Multivariable_Regression_and_Valuation_Model_(start).ipynb
boston.csv		boston.csv
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predict Housing Prices

Relevant Characteristics

Goals

1. Analyze and explore the Boston house price data

2. Split data for training and testing

3. Run a Multivariable Regression

4. Evaluate how your model's coefficients and residuals (???)

5. Use data transformation to improve model performance

6. Use model to estimate a property price

Data Characteristics

Descriptive Statistics:

Visualizing the Features

GOAL: Use Seaborn's `.displot()` to create a bar chart and superimpose the Kernel Density Estimate for

Run a pair plot

Joint Plots

Make it pretty

Better coefficients:

Analyze the Estimated Values and Regression Residuals

challenge one: Actual vs predicated prices:

Residuals vs Predicted Values

Why are residuals important?

Data transformations

How does the #log #transformation work?

y-hat and y_i

Cool #matplotlib formatting:

About

Uh oh!

Releases

Packages

Languages

Taky0nDon/day_81_housing_model

Folders and files

Latest commit

History

Repository files navigation

Predict Housing Prices

Relevant Characteristics

Goals

1. Analyze and explore the Boston house price data

2. Split data for training and testing

3. Run a Multivariable Regression

4. Evaluate how your model's coefficients and residuals (???)

5. Use data transformation to improve model performance

6. Use model to estimate a property price

Data Characteristics

Descriptive Statistics:

Visualizing the Features

GOAL: Use Seaborn's .displot() to create a bar chart and superimpose the Kernel Density Estimate for

Run a pair plot

Joint Plots

Make it pretty

Better coefficients:

Analyze the Estimated Values and Regression Residuals

challenge one: Actual vs predicated prices:

Residuals vs Predicted Values

Why are residuals important?

Data transformations

How does the #log #transformation work?

y-hat and y_i

Cool #matplotlib formatting:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

GOAL: Use Seaborn's `.displot()` to create a bar chart and superimpose the Kernel Density Estimate for

Packages