* The number of rooms
* Distance to employment centres
* How rich or poor the area is
* How many students there are per teacher in local schools etc
Imports: "scikit-learn" as "import sklearn"
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# TODO: Add missing import statements
:Number of Instances: 506
k:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population 14. PRICE Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
us DateFrame.describe()
to show the mix, max, mean, std dev, count, and quartiles for every column.
GOAL: Use Seaborn's .displot()
to create a bar chart and superimpose the Kernel Density Estimate for
1. PRICE
2. RM
3. DIS
4. RAD
Adding titles in #seaborn : instance.fig.suptitle("TITLE")
Some seaborn plots return a matplotlib Axes object, these are Axes-Level. Others are Figure-Level and return a seaborn object such as a FacetGrid
#matplotlib labeling:
Axes.set_xlabel(label)
Axes.set_ylabel(label)
- What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?
I would expect pollution to increase as distance to employment decreases. They would be inversely proportional.
- What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?
I would expect to home value to increase as the number of rooms increases.
- What about the amount of poverty in an area (LSTAT) and home prices?
I would expect area poverty to decrease as home prices increase.
A #pairplot allows you to visual relationships between columns. More on Seaborn pairplots
include a #regression-line:
nox_dis = data[["NOX", "DIS"]]
sns.pairplot(nox_dis, kind="reg", plot_kws={"line_kws": {"color": "cyan"}})
#jointplot documentation
with sns.axes_style('darkgrid'):
sns.jointplot(data=data, x="DIS", y="NOX",
height=8,
kind="scatter",
color="deeppink",
joint_kws={'alpha': 0.5})
Using #train_test_split documentation
from sklearn.model_selection import train_test_split
#regression
regression = LinearRegression()
regression.fit(X_train, y_train)
intercept = regression.intercept_
slope = regression.coef_
r_sqrd = regression.score(X_train, y_train)
print(f"R squared: {r_sqrd:.2}")
- Create subsets
from sklearn.model_selection import train_test_split
X, y = data[[col for col in data.columns if col != "PRICE"]], data["PRICE"]
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=10)
price_regression = LinearRegression()
price_regression.fit(X_train, y_train)
intercept = price_regression.intercept_
slope = price_regression.coef_
print(f"intercept: {intercept}\nslope:{sorted([s for s in slope])}")
^ coefficients OUT:
intercept: 36.53305138282431 slope:[-16.271988951469734, -1.4830135966050273, -0.8203056992885642, -0.581626431182139, -0.12818065642264795, -0.012082071043592574, -0.00757627601533797, 0.011418989022213357, 0.01629221534560711, 0.06319817864608888, 0.30398820612116106, 1.9745145165622597, 3.1084562454033] price_regression.score(X_train, y_train) X_train
#dataframe
regr_coef = pd.DataFrame(data=regression.coef_, index=X_train.columns,
columns=["Coefficient"])
regr_coef
price_regression.score(X_train, y_train)
X_train
^^ r-squared
Residuals are the difference between our model's prediction and the true value from y_train
.
Predicted_values = regression.predict(X_train)
residuals = (Y_train - predicted_values`)
- Actual prices on x-axis, predicted on y
- The distance of the data points from the regression line are the residuals
- Predicted price on x-axis
- residuals on y-axis
We can determine flaws in our model by analyzing the errors. If there is a pattern, that means we have a systematic error, ie: a flaw in our model. Ideally any errors would be 100% explained by chance.
We are particularly interested in the #skew and #mean of the #residuals
A perfect bell curve has a mean distance from the mean of zero and a skew of zero, meaning the graph is completely symmetrical.
#seaborn #format #title
y_hat = price_regression.predict(X_train)
y_i = y_train
actual_vs_predicted_prices = sns.scatterplot(x=y_hat,
y= y_i,
)
actual_vs_predicted_prices.set_title("Actual v Predicted Prices")
actual_vs_predicted_prices.set_xlabel("Predicted Price (Y-hat)")
actual_vs_predicted_prices.set_ylabel("Actual price (yi)")
![[Pasted image 20240124182816.png]] k alt, using bare #matplotlib :
actual = y_train
predicted_values = regression.predict(X_train)
residuals = actual - predicted_values
plt.figure()
plt.scatter(x=y_train, y=predicted_values, c='indigo', alpha=0.6)
# vv i dont understand, why do we do this? vv
plt.plot(y_train, y_train, color='cyan')
# with sns.axes_style('white'):
# pred_v_actual_regplot = sns.regplot(x=actual, y=predicted_values,)
# pred_v_actual_regplot.set_xlabel("Actual price")
# pred_v_actual_regplot.set_ylabel("predicted price")
# with sns.axes_style('darkgrid'):
# plt.figure()
# res_vs_predicted_scatter = sns.scatterplot(x=predicted_values, y=residuals)
# res_vs_predicted_scatter.set_xlabel("Predicted prices")
# res_vs_predicted_scatter.set_ylabel("Residuals")
Now do predicted price on the x-axis and residual(actual - minus predicted) on the y-axis
#kde super impose kde over histogram representation of a Series:
res_plot = sns.displot(x=residuals, kde=True)
![[Pasted image 20240124190502.png]]
At this point we must either consider a new model entirely, our transforming our data (actual) to make it better fit with our linear model
Is data["PRICE"]
a good candidate for log transformation?
target_skew = target.skew()
sns.displot(target, kde=True, color='green')
plt.title(f"Target has a skew of {target_skew: .3}")
price_plot = sns.displot(x=data["PRICE"],
kde=True,
)
price_skew = data.PRICE.skew()
print(f"Our price data has a skew of {price_skew}")
![[Pasted image 20240124191144.png]]
Use NumPy.log()
to create a series with the logarithmic prices
Compare the skews. Which is closer to zero?
log_price = np.log(data.PRICE)
print(f"The log data has a skew of {log_price.skew()}")
log_price_plot = sns.displot(x=log_price,
kde=True,
)
![[Pasted image 20240124191355.png]] The log transformation has a skew much closer to 0.
- Every datum is replaced by it's
ln
(natural log) - Large value are more affected that smaller ones. They are 'compressed'
If we use log prices, our model becomes: $$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$
y_hat = price_regression.predict(X_train)
# y_i is actual prices
y_i = y_train
y_hat is predicted prices, y_i is actual prices
y_hat = price_regression.predict(X_train)
# y_i is actual prices
y_i = y_train
actual_vs_predicted_prices = sns.scatterplot(x=y_hat,
y= y_i,
)
actual_vs_predicted_prices.set_title("Actual v Predicted Prices")
actual_vs_predicted_prices.set_xlabel("Predicted Price (Y-hat)")
actual_vs_predicted_prices.set_ylabel("Actual price (yi)")
residuals = y_i - y_hat
# y_train - price_regression.predict(X_train)
res_vs_predicted = sns.scatterplot(x=y_hat,
y=residuals)
res_vs_predicted.set_title("Residuals v Predicted Prices")
res_vs_predicted.set_xlabel("Predicted Price (Y-hat)")
res_vs_predicted.set_ylabel("Residuals")
residuals = (y_train (actual values!!) - regression_object.predict(X_train)(predicted values!!))
to prevent plots from overlapping, call plt.figure()
before each plot instantiation,
plt.title(f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$', fontsize=17)
plt.xlabel("Actual prices 000s $y _i$', fontsize=14")
plt.ylabel("Predicted prices 000s $\hat y_i$", fontsize=14)
plt.show()