A statistical deep-dive into Seattle, WA housing sales data — using ANOVA, Tukey's HSD, and regression analysis to identify which property attributes most significantly influence sale price.
Real estate agencies face a critical challenge: understanding which property features actually drive sale prices. This project analyzes approximately 20,000 housing sales records from the Seattle, Washington area (May 2014 – May 2015) to identify the factors that most significantly influence property sale prices.
We focused on three main systems of influence:
- Property age and renovation status — including time elapsed since last renovation
- Condition and grade ratings — individually and in combination
- Living-lot ratio — interior living space relative to total lot size
Statistical methods used include single-factor ANOVA, Tukey's HSD post-hoc testing, simple linear regression, and multiple regression analysis — all implemented in R/RStudio.
Key Result: Combined condition and grade ratings explain 52.89% of price variability (adjusted R² = 0.5289), far outperforming any single-factor model.
| # | Comparison | Research Question | Null Hypothesis | Method |
|---|---|---|---|---|
| H1 | Grade & Price | Does a property's grade rating influence its sale price? | H₀: μ₁ = μ₂ = μ₃ (all grade means equal) | ANOVA + Tukey's HSD |
| H2 | Condition & Price | Does a property's condition influence its sale price? | H₀: μ₁ = μ₂ = μ₃ (all condition means equal) | ANOVA + Tukey's HSD |
| H3 | Living-Lot Ratio & Price | How much influence does the living-lot ratio have on sale price? | H₀: β₁ = 0 | Simple Linear Regression |
| H4 | Renovation Age & Price | Does years since last renovation influence sale price? | H₀: β₁ = 0 | Simple Linear Regression |
| H5 | Condition + Grade & Price | Do condition and grade combined influence sale price? | H₀: β₁ = β₂ = 0 | Multiple Regression |
File: Spring_2025_16W_Project_Dataset.xlsx
Source: Seattle, WA housing sales — May 2014 to May 2015
Records: ~20,000 property sales
Sheets: Housing Data, Data Dictionary
| Column | Type | Description |
|---|---|---|
id |
String | Unique ID for each home sold |
date.sold |
Date | Date of the home sale |
price |
Float | Sale price of the property |
bedrooms |
Integer | Number of bedrooms |
bathrooms |
Float | Number of bathrooms (0.5 = toilet only, no shower) |
sqft_living |
Integer | Square footage of interior living space |
sqft_lot |
Integer | Square footage of total land area |
floors |
Float | Number of floors |
waterfront |
Binary | 1 = waterfront view, 0 = no waterfront |
view |
Integer | View quality index (0–4) |
condition |
Integer | Property condition index (1–5) |
grade |
Integer | Construction & design grade (1–13); 1–3 = below standard, 7 = average, 11–13 = high quality |
sqft_above |
Integer | Square footage above ground level |
sqft_basement |
Integer | Square footage of basement |
yr_built |
Integer | Year the property was originally built |
yr_renovated |
Integer | Year of last renovation (0 = never renovated) |
zipcode |
Integer | ZIP code of the property |
File: dataset_Project.xlsx
Additional columns derived for analysis:
| Feature | Description |
|---|---|
building_age |
Age of property at time of sale (year.sold − yr_built) |
age_buckets |
Categorized age groups: Under 20 / 20–50 / More than 50 years |
yrs_since_reno |
Years elapsed since last renovation |
renovated |
Binary flag: 1 = renovated, 0 = never renovated |
has_basement |
Binary flag: 1 = has basement, 0 = no basement |
living_lot_ratio |
sqft_living / sqft_lot — interior space relative to lot size |
age_w_reno |
Combined age accounting for renovation status |
year.sold |
Year extracted from date.sold |
ymo.sold |
Year-month extracted for time-series grouping |
Analytical sheets included:
| Sheet | Content |
|---|---|
summary stats |
Descriptive statistics of all numeric variables |
price_by_grade |
Average price broken down by grade rating |
price_by_condition |
Average price broken down by condition rating |
price_by_condition_and_grade |
Combined condition × grade price analysis |
price_by_ratio |
Price analysis by living-lot ratio |
price_by_age |
Price analysis by property age bucket |
price_by_renovated_age |
Price analysis by years since renovation |
- Descriptive statistics for all numeric variables
- Price distribution analysis (variance: 135,982,911,732 — highly dispersed)
- Property age distribution across six age brackets (0–120 years)
- Condition and grade frequency analysis
- Living-lot ratio outlier identification (townhomes vs. condominiums)
Used to test whether mean sale prices differ significantly across groups:
H₀: All group means are equal
Hₐ: Not all means are the same
Significance level: α = 0.05
Post-hoc pairwise comparison to identify which specific groups have statistically different mean prices. Implemented in R/RStudio with 95% confidence intervals.
Tested relationships between a single explanatory variable and price:
Model: Price ~ β₀ + β₁(X) + ε
Evaluation: Adjusted R², correlation coefficient, p-value, residual plots
Combined condition and grade as co-predictors:
Model: Price ~ β₀ + β₁(Grade) + β₂(Condition) + ε
Evaluation: Adjusted R², individual p-values per category
Table 1 — ANOVA results: Grade vs. Sale Price
| Metric | Value |
|---|---|
| F Statistic | 1937.97 |
| F Critical | 1.789 |
| p-value | ≈ 0 (< 0.05) |
| Decision | ✅ Reject H₀ |
Grade has a statistically significant influence on sale price. Higher grade evaluations (9+) are associated with distinctly higher average sale prices.
Figure 1 — 95% Confidence Intervals for pairwise grade differences (R/RStudio)
- Low-to-average grade pairings (≤ 9) tend to include zero in their confidence intervals → not statistically significant
- Pairings involving grades 9+ have the largest, most distinct average price differences
- The higher the grade, the more pronounced its effect on price
Table 2 — ANOVA results: Condition vs. Sale Price
| Metric | Value |
|---|---|
| F Statistic | 35.146 |
| F Critical | 2.372 |
| p-value | ≈ 0 (< 0.05) |
| Decision | ✅ Reject H₀ |
Condition rating has a statistically significant influence on sale price.
Figure 2 — 95% Confidence Intervals for pairwise condition differences (R/RStudio)
- Confidence intervals for pairings 2-1 and 4-1 include zero → differences are not statistically significant
- Lower condition ratings have wider confidence intervals (fewer observations at conditions 1 & 2)
- Better condition → higher price, most pronounced at condition ratings 4 and 5
Table 3 — Regression results: Living-Lot Ratio vs. Sale Price
Table 4 — Residual plot: Living-Lot Ratio regression model
| Metric | Value |
|---|---|
| Adjusted R² | 0.0132 (1.32% of variation explained) |
| Correlation | 0.1153 (positive, weak) |
| Decision | ❌ Accept H₀ |
Living-lot ratio alone is not a significant predictor of sale price. However, residual patterns suggest it may contribute meaningfully in a multi-variable model.
Table 5— Regression results: Renovation Age vs. Sale Price
| Metric | Value |
|---|---|
| Adjusted R² | 0.0074 (0.74% of variation explained) |
| Correlation | −0.0858 (negative, extremely weak) |
| p-value | 1.09E-34 |
| Decision | ❌ Accept H₀ |
Years since renovation is not a meaningful predictor of sale price alone.
Table 6 — Multiple regression results: Condition + Grade vs. Sale Price
| Metric | Value |
|---|---|
| Adjusted R² | 0.5289 (52.89% of variation explained) |
| Decision | ✅ Reject H₀ |
The combined condition + grade model is dramatically more powerful than any single-factor model. Grade evaluations of 9 and above and condition rating 5 are the most statistically significant predictors.
| Factor | Impact on Price | Statistical Significance |
|---|---|---|
| Grade (single) | Strong positive effect — higher grade = higher price | ✅ Significant (F = 1937.97) |
| Condition (single) | Positive effect — better condition = higher price | ✅ Significant (F = 35.146) |
| Living-Lot Ratio | Very weak positive relationship | ❌ Not significant alone (R² = 0.013) |
| Renovation Age | Extremely weak negative relationship | ❌ Not significant alone (R² = 0.007) |
| Grade + Condition (combined) | Explains 52.89% of price variation | ✅ Strongest model |
Top actionable insights for real estate agents:
- Prioritize grade (9+) and condition (5) when advising clients on pricing strategy
- Properties with above-average grade ratings command the largest price premiums
- Living-lot ratio and renovation age alone are poor price predictors — use them only as part of a broader model
- Future models should incorporate additional variables (location, waterfront, sqft) to push explanatory power beyond 52.89%
| Tool | Purpose |
|---|---|
| R | Statistical analysis (ANOVA, Tukey's HSD, regression) |
| RStudio | R development environment |
| Microsoft Excel | Data cleaning, feature engineering, and summary statistics |
| ggplot2 (R) | Visualization of confidence intervals and regression plots |