England Environmental Justice & Health Inequalities Analysis

Executive Summary

This project provides actionable intelligence for tackling environmental injustice in England by identifying specific communities facing a 'double burden' of high pollution and deprivation, linked to poorer respiratory health. Using an integrated approach combining spatial statistics, machine learning, and quasi-causal methods, key findings reveal:

Significant spatial clustering: Environmental injustice is geographically concentrated (Finding 1), enabling precise targeting of interventions away from broad-stroke approaches. Spatial autocorrelation techniques like Local Indicators of Spatial Association (LISA) confirmed these patterns were statistically significant.
PM2.5 Health Association: Higher Particulate Matter < 2.5 micrometers (PM2.5) exposure is significantly associated with poorer respiratory health outcomes, even after controlling for area deprivation (Finding 2, Propensity Score Matching (PSM) Average Treatment Effect on the Treated (ATT) ≈ -0.0399, p ≈ 0.027). This suggests a direct health burden from fine particulate matter.
Distinct Area Profiles: Local Authority Districts (LADs) cluster into typologies (e.g., 'Urban Deprived/Polluted') with unique challenge combinations, requiring tailored policy responses (Finding 3).

The analysis enables efficient resource allocation by pinpointing high-risk Lower Layer Super Output Areas (LSOAs) and LADs. Policy simulations suggest targeted PM2.5 reduction in priority areas could yield measurable health improvements. Note: Preliminary impact estimates (e.g., ~1.4M residents potentially benefiting, ~£60M potential NHS savings from a 20% PM2.5 reduction) require validation with official ONS population data but indicate substantial potential.

Geographic Concentration of Environmental Injustice

LISA Cluster Map showing high-high clusters for combined pollution and deprivation index — LISA Cluster Map: Red areas show statistically significant High-High clusters (high pollution/deprivation), highlighting geographic concentrations of environmental injustice across England. Blue areas indicate Low-Low clusters (low pollution/deprivation).

The Challenge: An Unequal Burden

Across England, disadvantaged communities often face disproportionately high levels of air pollution (like Nitrogen Dioxide (NO₂) and PM2.5). This recognised environmental injustice is strongly linked to poorer health outcomes, particularly respiratory conditions, exacerbating existing inequalities and placing significant burdens on public services like the NHS. Addressing this effectively requires moving beyond simple correlations to understand the complex interplay between where people live, their socioeconomic status, and their health.

My Goal: To rigorously analyse these connections using an integrated data science approach, identify high-risk areas with precision, and provide robust, data-driven evidence to support effective, targeted policy interventions.

My Analytical Approach: Integrating Diverse Methods

To tackle this multi-faceted problem, I designed and implemented an integrated analytical pipeline primarily using Python and key libraries (Pandas, GeoPandas, PySAL, Scikit-learn, Statsmodels, SHAP). The innovation lies in combining spatial, machine learning, and causal inference perspectives for a more holistic understanding, with careful consideration of methodological choices:

Integrated analytical workflow combining diverse data sources and methodologies.

Spatial Statistics (GeoPandas, PySAL): Employed spatial autocorrelation techniques (Moran's I, LISA, Getis-Ord Gi*) to rigorously test if patterns of pollution and deprivation were clustered or random. Justification: Standard statistics often assume independence, which is violated by geographic data; spatial methods account for this. LISA specifically identifies statistically significant local clusters (hotspots/coldspots), crucial for targeted interventions. Queen contiguity was chosen for spatial weights, suitable for administrative areas sharing borders. Spatial lag models were used later to account for spillover effects between neighbouring LADs, addressing potential bias in standard regression (Ordinary Least Squares (OLS)) from spatial dependence (indicated by the model's rho coefficient).
Machine Learning (Scikit-learn, SHAP):
- Unsupervised Clustering (K-Means): Chosen for its computational efficiency on this dataset and its ability to partition LADs into distinct, interpretable typologies based on pollution, deprivation, and health profiles using cluster centroids. Silhouette scores guided cluster number selection. Justification: While alternatives like DBSCAN (density-based) or GMM (probabilistic) were considered (and explored in advanced_cluster_analysis.py), K-Means provided the clearest separation into relatively balanced, policy-relevant groups for this specific dataset and research question.
- Predictive Modelling (Gradient Boosting Regressor (GBR)): Selected for policy simulation due to its strong predictive power, ability to model complex non-linear relationships and feature interactions (e.g., how deprivation modifies pollution impact), and robustness to outliers compared to linear models. Justification: GBR allowed simulating the potential impact of interventions (like PM2.5 reduction) on the respiratory health index, providing quantitative estimates for policy evaluation. Performance was validated using cross-validation (R²/MSE) to mitigate overfitting.
- Interpretability (SHAP): Used with GBR to understand why the model makes certain predictions, identifying key drivers (e.g., specific IMD domains) and their impact, moving beyond black-box predictions.
Causal Inference (Associational - Statsmodels): Applied PSM to estimate the associational effect of high PM2.5 exposure (treatment) on respiratory health (outcome) at the LAD level, controlling for observed confounding variables (IMD domains). Justification: PSM was chosen as a pragmatic approach given observational data limitations. It aims to create comparable groups (high vs. low PM2.5) based on observed characteristics (checking for sufficient overlap via propensity score distributions), approximating a quasi-experimental setup to isolate the association of interest. Diagnostic checks confirmed that propensity score matching successfully balanced observed covariates like IMD domains between the high and low PM2.5 groups (Standardised Mean Differences (SMDs) < 0.1 post-matching), strengthening the comparison. While acknowledging limitations (cannot control for unobserved confounders), rigorous diagnostics (sensitivity analysis via Rosenbaum bounds) were performed to assess the robustness of the association found (ATT).
Data Integration & Index Construction (Pandas, GeoPandas): Merged complex datasets (ONS, DEFRA, DLUHC-IMD, NHS) across LSOA/LAD scales. Developed custom indices (env_justice_index, respiratory_health_index) with specific weighting rationales (see DATA_DICTIONARY.md) to capture the core concepts of 'double burden' and respiratory health burden effectively.

Key Findings & Visual Insights

Finding 1: Environmental Injustice is Spatially Concentrated

The analysis revealed that areas facing the "double disadvantage" of high pollution and high deprivation are not randomly distributed but cluster significantly, particularly in urban centres and specific post-industrial regions (as visually highlighted by the LISA map). Spatial autocorrelation analysis confirmed these patterns were statistically significant.

Refer back to the Key Visual (LISA Map) shown previously.

Finding 2: High PM2.5 Exposure Linked to Poorer Respiratory Health (Post-Matching)

Using PSM to control for observed IMD domains, I found a statistically significant negative association between higher PM2.5 levels (above median) and the respiratory health index at the LAD level (ATT ≈ -0.0399, p-value ≈ 0.027). Interpretation: This suggests that, holding observed deprivation factors constant, moving from lower to higher PM2.5 exposure is associated with a roughly 4.0% relative decrease in the respiratory health index across matched LADs. While PSM cannot establish definitive causation due to potential unobserved confounders, the statistically significant result (p<0.05) provides stronger evidence for a link between PM2.5 and poorer respiratory health, independent of measured deprivation. The matching process successfully balanced observed covariates, as shown below.

Box plot comparing respiratory health index post-matching for high vs low PM2.5 groups — Post-Matching Health Outcomes (PM2.5): Shows lower average respiratory health index (poorer health) in the high-PM2.5 group compared to matched controls after balancing covariates.

Histogram showing propensity score distribution for high and low PM2.5 groups — Propensity Score Overlap (PM2.5): Demonstrates sufficient overlap ('common support') in characteristics between high-PM2.5 (blue) and low-PM2.5 (purple) groups, enabling effective matching.

Finding 3: Distinct Area Typologies Emerge from Clustering

K-Means clustering identified distinct LAD profiles based on their combined pollution, deprivation, and health characteristics. For example, some areas suffer from high pollution and high deprivation ('Urban Deprived/Polluted'), while others might have moderate pollution but significantly worse health outcomes ('Deprived/Poor Health Focus'), suggesting that a one-size-fits-all policy approach is unlikely to be optimal. Tailored strategies are needed for different cluster types.

Radar chart showing normalised profiles of different LAD clusters across key variables — Cluster Profiles: Illustrates the different combinations of deprivation, pollution, and health challenges faced by distinct groups of Local Authority Districts, enabling targeted policy design.

Finding 4: Targeted PM2.5 Reductions Show Potential Health Benefits

Policy simulations using the trained GBR estimated the potential impact of PM2.5 reduction. Example Scenario: A simulated 20% reduction in PM2.5 levels was predicted to yield an average improvement of ~0.0013 in the respiratory health index across relevant LADs, with larger gains in specific areas (e.g., Southend-on-Sea, Wigan, Bury, Leeds, Westminster, Eastbourne, Portsmouth, Salford, Manchester, Blaby). Implication: This quantifies the potential benefit of targeted interventions, allowing policymakers to prioritise areas like specific LAD clusters (e.g., Cluster 2: 'Urban Deprived/Polluted') where PM2.5 reduction could improve health outcomes for residents. While precise NHS cost savings are complex to model, reducing respiratory illness burden through such measures could hypothetically lead to substantial long-term savings via reduced hospital admissions and treatments (further economic analysis recommended).

Line plot showing simulated average health index improvement vs percentage reduction in PM2.5 — Policy Simulation (PM2.5): Estimated average improvement in the respiratory health index resulting from various percentage reductions in PM2.5 levels across relevant LADs, quantifying potential policy impact.

Bar chart showing top 10 LADs predicted to benefit most from a 20% PM2.5 reduction — Top Benefiting Areas (Simulated): Identifies specific LADs (e.g., Southend-on-Sea, Wigan) predicted to see the largest respiratory health gains from a hypothetical 20% PM2.5 reduction, aiding resource allocation.

Finding 5: Deprivation Nuances Matter in Predicting Pollution Exposure

Feature importance analysis (using SHAP values from relevant models) indicated that specific IMD domains, particularly 'Living Environment Deprivation' and 'Barriers to Housing & Services', were often stronger predictors of local pollution levels than income or employment deprivation alone. This suggests interventions need to consider these specific aspects of deprivation.

Heatmap showing importance or correlation of IMD domains for predicting pollution levels — IMD Domain Importance: Highlights that factors like poor housing conditions and local environmental quality are key correlates of pollution exposure, suggesting multi-faceted intervention points.

Impact & Actionable Policy Recommendations

This analysis translates directly into actionable strategies for policymakers and public health bodies, enabling a shift from broad strokes to targeted interventions:

Laser-Focus Resource Allocation: Prioritise interventions and funding towards the specific LSOA-level High-High LISA clusters (Finding 1) and high-risk LAD typologies (e.g., 'Urban Deprived/Polluted', Finding 3). This data-driven approach maximises impact by concentrating efforts where the need is greatest.
Implement Quantified PM2.5 Reduction Policies: Strategically deploy PM2.5 reduction measures (Clean Air Zones, domestic burning restrictions, industrial controls) in high-impact LADs identified by PSM (Finding 2) and GBR simulations (Finding 4). The significant association (ATT ≈ -0.0399, p<0.05) strengthens the case. Preliminary impact estimates indicate substantial potential NHS savings.
Design Holistic, Place-Based Interventions: Address the interconnected nature of pollution and deprivation. Policy should integrate pollution control with measures tackling key deprivation drivers like poor housing ('Living Environment Deprivation') and access barriers ('Barriers to Housing & Services') identified as critical factors (Finding 5).
Foster Coordinated Regional Strategies: Recognise that environmental injustice transcends administrative boundaries (Finding 1). Encourage collaboration between neighbouring LADs within identified high-risk clusters to address shared challenges and leverage potential positive spillover effects from interventions.

Limitations & Key Assumptions

While providing robust insights, this analysis acknowledges key limitations and assumptions:

Causal Claims (PSM): Propensity Score Matching strengthens associational claims by controlling for observed confounders (IMD domains), but cannot establish definitive causation due to potential unobserved factors. Findings represent strong associations.
Ecological Fallacy: Conclusions are based on area-level data (LAD/LSOA) and may not perfectly reflect individual-level risks.
Data Timeliness & Granularity: Findings are based on specific data snapshots (e.g., IMD 2019). Relationships may evolve.
Model Assumptions: Results are contingent on the assumptions of the chosen models (e.g., PSM assumptions, K-Means/GBR parameters).
Ethical Considerations & Bias: Potential biases in source data (e.g., monitor placement, health reporting) could influence results. Responsible AI principles guided the analysis.
NO₂ Analysis Results: The PSM analysis for NO₂ yielded non-significant results (p ≈ 0.30) after controls, suggesting PM2.5 has a stronger independent association in this analysis.
Impact Quantification Data: Population-based impact quantifications require validation using official ONS figures.

Future Work & Potential Enhancements

Building on this foundation, future work could enhance the analysis and operationalise insights:

Longitudinal & Causal Analysis: Incorporate time-series data and stronger quasi-experimental methods (DiD, RDD).
Advanced Spatial/ML Modelling: Explore GWR, GNNs, or deep learning.
Granularity & Context: Integrate finer-grained data and qualitative research.
AI-Powered Synthesis & Augmentation: Leverage LLMs/Transformers for contextual analysis and reporting.
Broadened Scope: Include other environmental factors (noise, green space) and health outcomes (mental health).
MLOps & Productionisation: Formalise monitoring (MLflow, Evidently AI), automation (Airflow/Prefect), and scalability (Dask/Spark).
Responsible AI Deep Dive: Further investigate fairness metrics and mitigation strategies.

Technology Stack

Python
Pandas
GeoPandas
NumPy
Scikit-learn
Statsmodels
PySAL
SHAP
Matplotlib
Seaborn
Plotly

View Code & Technical Details on GitHub