Using Neural Networks to Predict Home Values in Los Angeles County

Author

Will Sigal

Published

May 31, 2025

Introduction

This analysis examines real estate values across Los Angeles County using US Census data. We’ll test multiple machine learning approaches to predict median home values based on demographic, economic, and housing characteristics at the census tract level. The goal is to maximize our predictive power while only using the variables that the ACS provides.

Key Objectives

Extract and process comprehensive census data for LA County
Engineer meaningful features from raw census variables
Compare multiple machine learning models for price prediction
Visualize model performance and provide practical recommendations

Data Sources

US Census Bureau: American Community Survey (ACS) 5-year estimates (2023)
Geographic Data: Census tract shapefiles for LA County
Variables: Housing characteristics, demographics, income, education, employment

Data Collection

Fetching Census Data via API

We’ll collect a comprehensive set of variables from the Census API covering:

Housing tenure and characteristics
Population demographics
Income and employment
Educational attainment

Data fetched successfully!

Data Overview

The census data has been successfully fetched and processed. We now have:

2496 census tracts in Los Angeles County
45 variables covering housing, demographics, income, education, and employment
Renamed columns for better readability and analysis

Key variables include:

MedianHomeValue: Our target variable for prediction
Housing characteristics: Tenure, rent, vacancy rates
Demographics: Population, race, age, language
Economic indicators: Income, employment, poverty rates
Education levels: High school and bachelor’s degree attainment

Feature Engineering

Creating Derived Metrics

We’ll engineer meaningful features that capture relationships between raw variables. These derived metrics often have stronger predictive power than raw counts.

Here is what the merged Dataframe looks like:

	STATEFP	COUNTYFP	TRACTCE	GEOID	GEOIDFQ	NAME	NAMELSAD	MTFCC	FUNCSTAT	ALAND	...	Unemployment_Rate	Industry_Employment_Rate	Avg_Travel_Time_Minutes	Non_White_Population	Racial_Diversity_Index	Rent_to_Income_Ratio	Pct_Below_Poverty	Pct_Speak_Only_English	Pct_Speak_Spanish	Population_Density
0	06	037	204920	06037204920	1400000US06037204920	2049.20	Census Tract 2049.20	G5020	S	909972	...	0.113383	0.492006	15.300000	2141.0	0.982879	0.020113	0.150997	NaN	NaN	2702.280949
1	06	037	205110	06037205110	1400000US06037205110	2051.10	Census Tract 2051.10	G5020	S	286962	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	06	037	320101	06037320101	1400000US06037320101	3201.01	Census Tract 3201.01	G5020	S	680504	...	0.093970	0.655636	27.350000	2198.0	0.872756	0.018168	0.083458	NaN	NaN	4999.235860
3	06	037	205120	06037205120	1400000US06037205120	2051.20	Census Tract 2051.20	G5020	S	1466242	...	0.157821	0.479904	18.133333	3088.0	0.990666	0.046816	0.359930	NaN	NaN	2324.991373
4	06	037	206010	06037206010	1400000US06037206010	2060.10	Census Tract 2060.10	G5020	S	1418137	...	0.147111	0.614277	25.300000	2589.0	0.839815	0.021515	0.255874	NaN	NaN	2460.975209

5 rows × 72 columns

The merged dataframe has 2496 rows and 72 columns

We’ve successfully:

Extracted mainland LA County by identifying the largest contiguous polygon (excluding islands)
Merged census data with geographic boundaries using GEOID as the key
Created population density metric using land area from shapefiles
Cleaned infinite values that may have resulted from division operations

Model Training and Evaluation

Summary of models:

Random Forest - Ensemble of decision trees that averages many bootstrapped, feature-randomized trees to capture complex non-linear patterns while controlling over-fitting.
Gradient Boosting - Sequentially adds small “weak” trees, each one trained to correct the predecessor’s residuals, producing a powerful, high-bias-reduction model for complex data.
Ridge Regression - Ordinary least-squares with an L² penalty that shrinks coefficients, stabilizing estimates when predictors are correlated and reducing variance without sacrificing all linear interpretability.
Linear Regression - Fits a single linear hyperplane by minimizing squared error, offering a fast, easily interpretable baseline when relationships are approximately linear.
Neural Network - A multi-layer, non-linear function approximator that learns hierarchical feature interactions via back-propagation, excelling when data relationships are intricate and high-dimensional.
- Our architecture: Input (~50 features) → 256 ReLU → 128 ReLU → 64 ReLU → 1 Sigmoid; Min-Max scaling, engineered ratios, SelectKBest feature filtering (≤50), dropout after each hidden layer, Adam optimizer with LR scheduling, early stopping on validation MSE.

Here is the code for the model training:

	Model	MAE	RMSE	R²
2	Random Forest	$131,515	$188,255	0.737
3	Gradient Boosting	$131,080	$191,637	0.728
4	3-Layer Sigmoid MLP	$132,183	$192,780	0.725
0	Linear Regression	$141,345	$195,741	0.716
1	Ridge Regression	$156,781	$217,740	0.649

Summary of results:

Consistent with what we would expect: Neural Network performs the best, followed by Gradient Boosting, Random Forest, Ridge Regression, and Linear Regression.
Our NN has a MAE of just over $120,000, which is a Mean Absolute Percentage Error of about 17%

Feature Importance Analysis:


======================================================================
TOP 10 MOST IMPORTANT FEATURES
======================================================================
 1. Speak_Only_English                   105,408 ±  6,760
 2. Rental_Rate                           14,180 ±  3,175
 3. ACS_Year                              12,497 ±  2,298
 4. Education_BachelorsOrHigher            5,301 ±  1,210
 5. Racial_Diversity_Index                 4,853 ±  1,765
 6. Industry_TotalEmployed                 4,747 ±  1,548
 7. HousingUnits_Owner                     3,212 ±  1,423
 8. Speak_Spanish                          2,866 ±  1,017
 9. Unemployment_Rate                      2,616 ±    752
10. Poverty_Total                          2,473 ±  1,186
======================================================================

Looking deeper into the residuals:


============================================================
NEURAL NETWORK RESIDUAL ANALYSIS SUMMARY
============================================================
Mean Absolute Error (MAE):        $132,183
Root Mean Square Error (RMSE):    $192,780
Mean Absolute Percentage Error:   18.6%
Standard Deviation of Residuals:  $192,230
Median Absolute Error:            $90,429
Predictions within ±20%:          73.7%
Predictions within ±10%:          45.7%
============================================================

The model is essentially unbiased on average, but typical absolute error is about $122 K; a handful of extreme misses inflate the tails.
Slight systematic under-prediction (especially on higher-priced homes), and relative accuracy (MAPE ≈ 17.6 %) is tighter than the raw-dollar view because error is scaled by value.
Central residuals are roughly Gaussian, yet heavy tails indicate more extreme errors than a normal model would expect—important for risk assessment.
We might have heteroscedasticity, where the variance of the residuals is not constant across the range of predicted values. This is a problem because it violates the assumption of homoscedasticity, which is a key assumption of many statistical tests. We can see this potential issue in the box plot of residuals by prediction range (residuals are larger for higher-priced homes and lower-priced homes).

Conclusions and Recommendations:

General Conclusions:

Los Angeles County’s census‐tract fundamentals can explain much of the variation in median home values: a three-layer neural network trained on fewer than 50 engineered features delivers MAE ≈ $127 K, RMSE ≈ $188 K and R² ≈ 0.74—accurate to within ±20 % for roughly three-quarters of tracts and within ±10 % for almost half of them .

Key Takeaways:

Primary Language, Predominant Rental Rate and % Below the Poverty Rate are the most important features. Ratios such as College-Degree Share and Population-per-Housing Unit were strong features, but not as important as the other features.
Model choice matters, but not dramatically. Tree ensembles (Random Forest, Gradient Boosting) trail the neural net by only ~3 % in MAE, while linear and ridge regressions give up ~10 – 20 % of accuracy—still respectable baselines for rapid prototyping.
Error isn’t uniform. Residual diagnostics reveal heavier tails than a normal curve and a slight tendency to under-price the most expensive neighborhoods, signalling heteroscedastic risk that should be acknowledged in any underwriting deck.

Implications for CRE:

Sharper underwriting - Instant, model-backed price checks flag mis-priced listings early, so you spend diligence dollars only where the spread is real.
Tighter pro formas - Smaller valuation bands translate to more confident equity sizing, cleaner DSCR cushions, and clearer exit sensitivities.
Data-driven market selection - Feature importances highlight sub-markets where fundamentals—not just momentum—support outsized, risk-adjusted returns.

Next Steps:

While this analysis was for Single-Family Residences, it could be extended to other property types - including multi-family, commerical, and industrial.