Using Neural Networks to Predict Home Values in Los Angeles County

Author

Will Sigal

Published

May 31, 2025

< Back to Projects

Introduction

This analysis examines real estate values across Los Angeles County using US Census data. We’ll test multiple machine learning approaches to predict median home values based on demographic, economic, and housing characteristics at the census tract level. The goal is to maximize our predictive power while only using the variables that the ACS provides.

Key Objectives

  • Extract and process comprehensive census data for LA County
  • Engineer meaningful features from raw census variables
  • Compare multiple machine learning models for price prediction
  • Visualize model performance and provide practical recommendations

Data Sources

  • US Census Bureau: American Community Survey (ACS) 5-year estimates (2023)
  • Geographic Data: Census tract shapefiles for LA County
  • Variables: Housing characteristics, demographics, income, education, employment

Data Collection

Fetching Census Data via API

We’ll collect a comprehensive set of variables from the Census API covering:

  • Housing tenure and characteristics

  • Population demographics

  • Income and employment

  • Educational attainment

Data fetched successfully!

Data Overview

The census data has been successfully fetched and processed. We now have:

  • 2496 census tracts in Los Angeles County

  • 45 variables covering housing, demographics, income, education, and employment

  • Renamed columns for better readability and analysis

Key variables include:

  • MedianHomeValue: Our target variable for prediction

  • Housing characteristics: Tenure, rent, vacancy rates

  • Demographics: Population, race, age, language

  • Economic indicators: Income, employment, poverty rates

  • Education levels: High school and bachelor’s degree attainment

Feature Engineering

Creating Derived Metrics

We’ll engineer meaningful features that capture relationships between raw variables. These derived metrics often have stronger predictive power than raw counts.

Here is what the merged Dataframe looks like:

STATEFP COUNTYFP TRACTCE GEOID GEOIDFQ NAME NAMELSAD MTFCC FUNCSTAT ALAND ... Unemployment_Rate Industry_Employment_Rate Avg_Travel_Time_Minutes Non_White_Population Racial_Diversity_Index Rent_to_Income_Ratio Pct_Below_Poverty Pct_Speak_Only_English Pct_Speak_Spanish Population_Density
0 06 037 204920 06037204920 1400000US06037204920 2049.20 Census Tract 2049.20 G5020 S 909972 ... 0.113383 0.492006 15.300000 2141.0 0.982879 0.020113 0.150997 NaN NaN 2702.280949
1 06 037 205110 06037205110 1400000US06037205110 2051.10 Census Tract 2051.10 G5020 S 286962 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 06 037 320101 06037320101 1400000US06037320101 3201.01 Census Tract 3201.01 G5020 S 680504 ... 0.093970 0.655636 27.350000 2198.0 0.872756 0.018168 0.083458 NaN NaN 4999.235860
3 06 037 205120 06037205120 1400000US06037205120 2051.20 Census Tract 2051.20 G5020 S 1466242 ... 0.157821 0.479904 18.133333 3088.0 0.990666 0.046816 0.359930 NaN NaN 2324.991373
4 06 037 206010 06037206010 1400000US06037206010 2060.10 Census Tract 2060.10 G5020 S 1418137 ... 0.147111 0.614277 25.300000 2589.0 0.839815 0.021515 0.255874 NaN NaN 2460.975209

5 rows × 72 columns

The merged dataframe has 2496 rows and 72 columns

We’ve successfully:

  1. Extracted mainland LA County by identifying the largest contiguous polygon (excluding islands)

  2. Merged census data with geographic boundaries using GEOID as the key

  3. Created population density metric using land area from shapefiles

  4. Cleaned infinite values that may have resulted from division operations

Model Training and Evaluation

Summary of models:

  1. Random Forest - Ensemble of decision trees that averages many bootstrapped, feature-randomized trees to capture complex non-linear patterns while controlling over-fitting.

  2. Gradient Boosting - Sequentially adds small “weak” trees, each one trained to correct the predecessor’s residuals, producing a powerful, high-bias-reduction model for complex data.

  3. Ridge Regression - Ordinary least-squares with an L² penalty that shrinks coefficients, stabilizing estimates when predictors are correlated and reducing variance without sacrificing all linear interpretability.

  4. Linear Regression - Fits a single linear hyperplane by minimizing squared error, offering a fast, easily interpretable baseline when relationships are approximately linear.

  5. Neural Network - A multi-layer, non-linear function approximator that learns hierarchical feature interactions via back-propagation, excelling when data relationships are intricate and high-dimensional.

    • Our architecture: Input (~50 features) → 256 ReLU → 128 ReLU → 64 ReLU → 1 Sigmoid; Min-Max scaling, engineered ratios, SelectKBest feature filtering (≤50), dropout after each hidden layer, Adam optimizer with LR scheduling, early stopping on validation MSE.

Here is the code for the model training:

  Model MAE RMSE
2 Random Forest $131,515 $188,255 0.737
3 Gradient Boosting $131,080 $191,637 0.728
4 3-Layer Sigmoid MLP $132,183 $192,780 0.725
0 Linear Regression $141,345 $195,741 0.716
1 Ridge Regression $156,781 $217,740 0.649

Summary of results:

  • Consistent with what we would expect: Neural Network performs the best, followed by Gradient Boosting, Random Forest, Ridge Regression, and Linear Regression.
  • Our NN has a MAE of just over $120,000, which is a Mean Absolute Percentage Error of about 17%

Feature Importance Analysis:


======================================================================
TOP 10 MOST IMPORTANT FEATURES
======================================================================
 1. Speak_Only_English                   105,408 ±  6,760
 2. Rental_Rate                           14,180 ±  3,175
 3. ACS_Year                              12,497 ±  2,298
 4. Education_BachelorsOrHigher            5,301 ±  1,210
 5. Racial_Diversity_Index                 4,853 ±  1,765
 6. Industry_TotalEmployed                 4,747 ±  1,548
 7. HousingUnits_Owner                     3,212 ±  1,423
 8. Speak_Spanish                          2,866 ±  1,017
 9. Unemployment_Rate                      2,616 ±    752
10. Poverty_Total                          2,473 ±  1,186
======================================================================

Looking deeper into the residuals:


============================================================
NEURAL NETWORK RESIDUAL ANALYSIS SUMMARY
============================================================
Mean Absolute Error (MAE):        $132,183
Root Mean Square Error (RMSE):    $192,780
Mean Absolute Percentage Error:   18.6%
Standard Deviation of Residuals:  $192,230
Median Absolute Error:            $90,429
Predictions within ±20%:          73.7%
Predictions within ±10%:          45.7%
============================================================
  • The model is essentially unbiased on average, but typical absolute error is about $122 K; a handful of extreme misses inflate the tails.

  • Slight systematic under-prediction (especially on higher-priced homes), and relative accuracy (MAPE ≈ 17.6 %) is tighter than the raw-dollar view because error is scaled by value.

  • Central residuals are roughly Gaussian, yet heavy tails indicate more extreme errors than a normal model would expect—important for risk assessment.

  • We might have heteroscedasticity, where the variance of the residuals is not constant across the range of predicted values. This is a problem because it violates the assumption of homoscedasticity, which is a key assumption of many statistical tests. We can see this potential issue in the box plot of residuals by prediction range (residuals are larger for higher-priced homes and lower-priced homes).

Conclusions and Recommendations:

General Conclusions:

  • Los Angeles County’s census‐tract fundamentals can explain much of the variation in median home values: a three-layer neural network trained on fewer than 50 engineered features delivers MAE ≈ $127 K, RMSE ≈ $188 K and R² ≈ 0.74—accurate to within ±20 % for roughly three-quarters of tracts and within ±10 % for almost half of them .

Key Takeaways:

  • Primary Language, Predominant Rental Rate and % Below the Poverty Rate are the most important features. Ratios such as College-Degree Share and Population-per-Housing Unit were strong features, but not as important as the other features.

  • Model choice matters, but not dramatically. Tree ensembles (Random Forest, Gradient Boosting) trail the neural net by only ~3 % in MAE, while linear and ridge regressions give up ~10 – 20 % of accuracy—still respectable baselines for rapid prototyping.

  • Error isn’t uniform. Residual diagnostics reveal heavier tails than a normal curve and a slight tendency to under-price the most expensive neighborhoods, signalling heteroscedastic risk that should be acknowledged in any underwriting deck.

Implications for CRE:

  • Sharper underwriting - Instant, model-backed price checks flag mis-priced listings early, so you spend diligence dollars only where the spread is real.

  • Tighter pro formas - Smaller valuation bands translate to more confident equity sizing, cleaner DSCR cushions, and clearer exit sensitivities.

  • Data-driven market selection - Feature importances highlight sub-markets where fundamentals—not just momentum—support outsized, risk-adjusted returns.

Next Steps:

  • While this analysis was for Single-Family Residences, it could be extended to other property types - including multi-family, commerical, and industrial.