Data fetched successfully!
Using Neural Networks to Predict Home Values in Los Angeles County
Introduction
This analysis examines real estate values across Los Angeles County using US Census data. We’ll test multiple machine learning approaches to predict median home values based on demographic, economic, and housing characteristics at the census tract level. The goal is to maximize our predictive power while only using the variables that the ACS provides.
Key Objectives
- Extract and process comprehensive census data for LA County
- Engineer meaningful features from raw census variables
- Compare multiple machine learning models for price prediction
- Visualize model performance and provide practical recommendations
Data Sources
- US Census Bureau: American Community Survey (ACS) 5-year estimates (2023)
- Geographic Data: Census tract shapefiles for LA County
- Variables: Housing characteristics, demographics, income, education, employment
Data Collection
Fetching Census Data via API
We’ll collect a comprehensive set of variables from the Census API covering:
Housing tenure and characteristics
Population demographics
Income and employment
Educational attainment
Data Overview
The census data has been successfully fetched and processed. We now have:
2496 census tracts in Los Angeles County
45 variables covering housing, demographics, income, education, and employment
Renamed columns for better readability and analysis
Key variables include:
MedianHomeValue: Our target variable for prediction
Housing characteristics: Tenure, rent, vacancy rates
Demographics: Population, race, age, language
Economic indicators: Income, employment, poverty rates
Education levels: High school and bachelor’s degree attainment
Feature Engineering
Creating Derived Metrics
We’ll engineer meaningful features that capture relationships between raw variables. These derived metrics often have stronger predictive power than raw counts.
Here is what the merged Dataframe looks like:
STATEFP | COUNTYFP | TRACTCE | GEOID | GEOIDFQ | NAME | NAMELSAD | MTFCC | FUNCSTAT | ALAND | ... | Unemployment_Rate | Industry_Employment_Rate | Avg_Travel_Time_Minutes | Non_White_Population | Racial_Diversity_Index | Rent_to_Income_Ratio | Pct_Below_Poverty | Pct_Speak_Only_English | Pct_Speak_Spanish | Population_Density | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 06 | 037 | 204920 | 06037204920 | 1400000US06037204920 | 2049.20 | Census Tract 2049.20 | G5020 | S | 909972 | ... | 0.113383 | 0.492006 | 15.300000 | 2141.0 | 0.982879 | 0.020113 | 0.150997 | NaN | NaN | 2702.280949 |
1 | 06 | 037 | 205110 | 06037205110 | 1400000US06037205110 | 2051.10 | Census Tract 2051.10 | G5020 | S | 286962 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 06 | 037 | 320101 | 06037320101 | 1400000US06037320101 | 3201.01 | Census Tract 3201.01 | G5020 | S | 680504 | ... | 0.093970 | 0.655636 | 27.350000 | 2198.0 | 0.872756 | 0.018168 | 0.083458 | NaN | NaN | 4999.235860 |
3 | 06 | 037 | 205120 | 06037205120 | 1400000US06037205120 | 2051.20 | Census Tract 2051.20 | G5020 | S | 1466242 | ... | 0.157821 | 0.479904 | 18.133333 | 3088.0 | 0.990666 | 0.046816 | 0.359930 | NaN | NaN | 2324.991373 |
4 | 06 | 037 | 206010 | 06037206010 | 1400000US06037206010 | 2060.10 | Census Tract 2060.10 | G5020 | S | 1418137 | ... | 0.147111 | 0.614277 | 25.300000 | 2589.0 | 0.839815 | 0.021515 | 0.255874 | NaN | NaN | 2460.975209 |
5 rows × 72 columns
The merged dataframe has 2496 rows and 72 columns
We’ve successfully:
Extracted mainland LA County by identifying the largest contiguous polygon (excluding islands)
Merged census data with geographic boundaries using GEOID as the key
Created population density metric using land area from shapefiles
Cleaned infinite values that may have resulted from division operations
Model Training and Evaluation
Summary of models:
Random Forest - Ensemble of decision trees that averages many bootstrapped, feature-randomized trees to capture complex non-linear patterns while controlling over-fitting.
Gradient Boosting - Sequentially adds small “weak” trees, each one trained to correct the predecessor’s residuals, producing a powerful, high-bias-reduction model for complex data.
Ridge Regression - Ordinary least-squares with an L² penalty that shrinks coefficients, stabilizing estimates when predictors are correlated and reducing variance without sacrificing all linear interpretability.
Linear Regression - Fits a single linear hyperplane by minimizing squared error, offering a fast, easily interpretable baseline when relationships are approximately linear.
Neural Network - A multi-layer, non-linear function approximator that learns hierarchical feature interactions via back-propagation, excelling when data relationships are intricate and high-dimensional.
- Our architecture: Input (~50 features) → 256 ReLU → 128 ReLU → 64 ReLU → 1 Sigmoid; Min-Max scaling, engineered ratios, SelectKBest feature filtering (≤50), dropout after each hidden layer, Adam optimizer with LR scheduling, early stopping on validation MSE.
Here is the code for the model training:
Model | MAE | RMSE | R² | |
---|---|---|---|---|
2 | Random Forest | $131,515 | $188,255 | 0.737 |
3 | Gradient Boosting | $131,080 | $191,637 | 0.728 |
4 | 3-Layer Sigmoid MLP | $132,183 | $192,780 | 0.725 |
0 | Linear Regression | $141,345 | $195,741 | 0.716 |
1 | Ridge Regression | $156,781 | $217,740 | 0.649 |
Summary of results:
- Consistent with what we would expect: Neural Network performs the best, followed by Gradient Boosting, Random Forest, Ridge Regression, and Linear Regression.
- Our NN has a MAE of just over $120,000, which is a Mean Absolute Percentage Error of about 17%
Feature Importance Analysis:
======================================================================
TOP 10 MOST IMPORTANT FEATURES
======================================================================
1. Speak_Only_English 105,408 ± 6,760
2. Rental_Rate 14,180 ± 3,175
3. ACS_Year 12,497 ± 2,298
4. Education_BachelorsOrHigher 5,301 ± 1,210
5. Racial_Diversity_Index 4,853 ± 1,765
6. Industry_TotalEmployed 4,747 ± 1,548
7. HousingUnits_Owner 3,212 ± 1,423
8. Speak_Spanish 2,866 ± 1,017
9. Unemployment_Rate 2,616 ± 752
10. Poverty_Total 2,473 ± 1,186
======================================================================
Looking deeper into the residuals:
============================================================
NEURAL NETWORK RESIDUAL ANALYSIS SUMMARY
============================================================
Mean Absolute Error (MAE): $132,183
Root Mean Square Error (RMSE): $192,780
Mean Absolute Percentage Error: 18.6%
Standard Deviation of Residuals: $192,230
Median Absolute Error: $90,429
Predictions within ±20%: 73.7%
Predictions within ±10%: 45.7%
============================================================
The model is essentially unbiased on average, but typical absolute error is about $122 K; a handful of extreme misses inflate the tails.
Slight systematic under-prediction (especially on higher-priced homes), and relative accuracy (MAPE ≈ 17.6 %) is tighter than the raw-dollar view because error is scaled by value.
Central residuals are roughly Gaussian, yet heavy tails indicate more extreme errors than a normal model would expect—important for risk assessment.
We might have heteroscedasticity, where the variance of the residuals is not constant across the range of predicted values. This is a problem because it violates the assumption of homoscedasticity, which is a key assumption of many statistical tests. We can see this potential issue in the box plot of residuals by prediction range (residuals are larger for higher-priced homes and lower-priced homes).
Conclusions and Recommendations:
General Conclusions:
- Los Angeles County’s census‐tract fundamentals can explain much of the variation in median home values: a three-layer neural network trained on fewer than 50 engineered features delivers MAE ≈ $127 K, RMSE ≈ $188 K and R² ≈ 0.74—accurate to within ±20 % for roughly three-quarters of tracts and within ±10 % for almost half of them .
Key Takeaways:
Primary Language, Predominant Rental Rate and % Below the Poverty Rate are the most important features. Ratios such as College-Degree Share and Population-per-Housing Unit were strong features, but not as important as the other features.
Model choice matters, but not dramatically. Tree ensembles (Random Forest, Gradient Boosting) trail the neural net by only ~3 % in MAE, while linear and ridge regressions give up ~10 – 20 % of accuracy—still respectable baselines for rapid prototyping.
Error isn’t uniform. Residual diagnostics reveal heavier tails than a normal curve and a slight tendency to under-price the most expensive neighborhoods, signalling heteroscedastic risk that should be acknowledged in any underwriting deck.
Implications for CRE:
Sharper underwriting - Instant, model-backed price checks flag mis-priced listings early, so you spend diligence dollars only where the spread is real.
Tighter pro formas - Smaller valuation bands translate to more confident equity sizing, cleaner DSCR cushions, and clearer exit sensitivities.
Data-driven market selection - Feature importances highlight sub-markets where fundamentals—not just momentum—support outsized, risk-adjusted returns.
Next Steps:
- While this analysis was for Single-Family Residences, it could be extended to other property types - including multi-family, commerical, and industrial.