Enterprise-Scale MLOps for Healthcare Prediction

Showcasing an enterprise-ready Machine Learning Operations (MLOps) pipeline for clinical readmission prediction. This strategic Proof-of-Concept (PoC), using the Medical Information Mart for Intensive Care (MIMIC)-III Demo data, emphasises methodological rigour, end-to-end best practices, and production readiness for reliable healthcare Artificial Intelligence (AI).

Project Overview

A production-ready MLOps pipeline for healthcare prediction

About This Project

This project presents a Strategic Proof-of-Concept (PoC) focused on building an MLOps pipeline using the MIMIC-III Clinical Database Demo (v1.4). While the demo dataset limits the statistical power of the results, it provides a complex, realistic environment to demonstrate an enterprise-ready MLOps architecture capable of handling real-world healthcare data challenges, specifically targeting the Electronic Health Record (EHR) complexity. The primary goal is showcasing methodological rigour, software engineering best practices, and an end-to-end workflow designed for reliability, reproducibility, and scalability.

Focus on Process & Scalability: The pipeline architecture, code structure, and MLOps components (testing, monitoring, deployment) are the key deliverables. Designed for the full MIMIC dataset scale and production EHR systems, the performance metrics shown are illustrative of the insights achievable via this robust process.

Current Status

  • Implemented Core MLOps Pipeline (Data Proc, Training, Basic Eval)
  • Implemented Imbalance Handling Analysis & SMOTE Integration
  • In Progress Advanced Temporal Modelling (Time-Aware LSTM) Integration
  • In Progress Basic Continuous Integration (CI) Workflow Setup (GitHub Actions)
  • Conceptual Causal Inference Exploration (Initial ATE Estimate)
  • Conceptual Advanced Monitoring & Deployment Strategies

Project Goals

  • Implement a complete MLOps pipeline from data processing to deployment concept.
  • Demonstrate best practices in feature engineering for clinical data.
  • Showcase advanced techniques for handling class imbalance.
  • Implement ethical AI considerations (Explainable AI (XAI) and fairness analysis).
  • Provide comprehensive documentation and reproducible workflows.
  • Provide a production-ready foundation designed to scale to the full MIMIC dataset and beyond.

Scalability Considerations

Transitioning from this PoC to the full MIMIC dataset or a production EHR system involves addressing key scalability challenges:

Data Engineering at Scale
  • ETL: Utilise distributed processing frameworks like Apache Spark or Dask for handling terabytes of raw data efficiently.
  • Feature Stores: Implement feature stores such as Feast or Tecton for managing versioned features, ensuring consistency between training and serving, and enabling feature reuse.
  • Data Validation: Employ robust data validation tools (e.g., Great Expectations) within the pipeline to handle data quality issues at scale.
Training & Infrastructure
  • Distributed Training: Leverage frameworks like Horovod or PyTorch DDP for training complex models (like LSTMs/Transformers) on large datasets across multiple GPUs/nodes.
  • Hyperparameter Tuning: Employ efficient, parallelised tuning libraries like Ray Tune or Optuna.
  • Infrastructure: Utilise managed cloud ML platforms (AWS SageMaker, GCP Vertex AI, Azure ML) or Kubernetes for scalable resource provisioning and orchestration (including GPU management).
Monitoring at Scale

Drift detection and performance monitoring must handle high throughput. This involves:

  • Specialised monitoring tools (e.g., Arize, Fiddler, WhyLabs) or robust custom statistical process control (SPC) implementations.
  • Efficient logging and aggregation strategies for handling millions of daily predictions.
  • Automated alerting based on predefined drift thresholds and performance degradation.

Resource Estimation (Illustrative)

Resource Demo Implementation Full-Scale Estimate Scaling Factor
Compute (Training) 4-core CPU, 16GB RAM 32+ core Dist. Cluster, 128GB+ RAM, Multi-GPU 8-16x+
Storage (Inc. Versions) ~5GB ~2-5TB+ 400-1000x+
Inference Throughput ~50 req/sec (Single Node) 1000+ req/sec (Auto-scaling Cluster) 20x+
Est. Monthly Cloud Cost ~$50 (~£40) ~$2,500 - $10,000+ (~£2,000 - £8,000+) 50-200x+
Note: Full-scale costs highly dependent on data volume, model complexity, query frequency, and chosen cloud services.

Prediction Tasks

The project focuses on three key prediction tasks in critical care:

  1. 30-day hospital readmission risk (Primary Focus)
  2. Intensive Care Unit (ICU) mortality prediction
  3. Length of stay estimation

System Architecture

A modular, scalable architecture designed for research and production

Architecture Overview

The project follows a structured MLOps pipeline from data ingestion to deployment:

Data Sources

MIMIC-III clinical database with patient demographics, vital signs, lab values, medications, procedures, and diagnoses.

Data Processing

Robust Extract, Transform, Load (ETL) pipeline with data cleaning, preprocessing, and transformation.

Feature Engineering

Comprehensive feature extraction, including temporal patterns and domain-specific transformations.

Model Development

Training, hyperparameter tuning, evaluation, and interpretation with a focus on handling class imbalance.

Deployment & Monitoring

REST Application Programming Interface (API), interactive dashboard, and continuous monitoring concepts for model performance and data drift.

Directory Structure

mimic-readmission-predictor/
├── api/                       # FastAPI implementation
├── assets/                  # Generated plots & results
├── configs/                 # Configuration files (YAML)
├── dashboard/               # Streamlit dashboard
├── data/                    # Raw, processed data
├── docs/                    # Project documentation
├── models/                  # Saved model artefacts
├── notebooks/               # Exploration & PoCs
├── src/                     # Source code (ETL, features, models)
│   ├── data/
│   ├── features/
│   ├── models/
│   ├── visualisation/
│   └── utils/
├── tests/                   # Unit & integration tests
├── .github/                 # CI/CD workflows (conceptual)
├── .gitignore
├── FUTURE_WORK.md           # Detailed future enhancements
├── index.html               # This page
├── LICENSE
├── README.md
└── requirements.txt

The project follows a modular structure facilitating maintainability, testability, and adherence to MLOps principles.

For a detailed view of the architecture, including component descriptions and data flow, see the Architecture Documentation.

Key Features & Innovations

Advanced techniques and pipeline design concepts

Advanced Temporal Modelling

Exploring the sequential nature of EHR data using a Time-Aware Long Short-Term Memory (LSTM) with Attention PoC (see notebooks/time_aware_lstm.py). This approach aims to capture irregular time intervals and identify critical time points.

class TimeAwareLSTM(nn.Module):
    # ... (Initialisation) ...
    def forward(self, x, time_intervals):
        # Incorporate learned time embeddings
        time_encoding = self.time_encoder(time_intervals)
        x_combined = torch.cat([x, time_encoding], dim=2)

        # Pack, LSTM, Unpack...
        lstm_out, _ = self.lstm(...)

        # Attention mechanism identifies key time steps
        context, attn_weights = self.attention(lstm_out)
        # ... Classifier ...

This model explicitly incorporates time dynamics, a critical factor often ignored by static models. Performance evaluation on the demo dataset is presented in the Visualisations section.

Imbalance Handling Analysis

Systematic analysis of techniques for handling significant class imbalance (7.2:1 ratio in demo data). Compared Baseline, Class Weights, Random Oversampling, SMOTE (Synthetic Minority Over-sampling Technique), and Random Undersampling using cross-validation.

Key Finding: SMOTE demonstrated the best-balanced performance (F1-score) on the demo dataset, effectively improving recall without excessively sacrificing precision.

See src/models/imbalance_analysis.py and Visualisations for implementation and results.

Causal Inference Exploration

Exploring techniques to estimate the causal effect of interventions or features on readmission, moving beyond simple correlations found in observational data like MIMIC.

Technique Example: Utilised Doubly Robust Estimation (via libraries like EconML) to estimate the Average Treatment Effect (ATE) of a specific intervention (e.g., prior circulatory diagnosis) while controlling for measured confounders (specifically `age` in this PoC) present in the observational MIMIC data.

Preliminary Finding (Highly Illustrative): Initial analysis on this demo subset suggests the intervention has an estimated ATE of -0.03 (a 3% reduction) on 30-day readmission probability. This is purely illustrative due to demo data limitations and minimal confounder control.

Definitive causal claims require significant further work, including larger datasets, rigorous sensitivity analysis, and exploring alternative identification strategies. See notebook for PoC.

MLOps Framework Design

Designing an end-to-end MLOps framework including Continuous Integration/Continuous Deployment (CI/CD), monitoring, and deployment strategies for reliability at scale.

Conceptual CI/CD Flow

Code Push (Git)

Automated Tests

Model Retrain

Validation

Deploy (API/Dash)

Conceptual Monitoring Strategy
  • Data Drift: Plan includes monitoring key features using Population Stability Index (PSI) and statistical tests (e.g., Kolmogorov-Smirnov (KS)).
  • Concept Drift: Plan includes tracking model performance metrics (AUC, F1-score) over time.
  • Tooling: Plan includes integration with logging (MLflow), dashboards (Grafana), and potentially drift detection tools (Evidently AI).

See MLOps Status for current implementation details and Future Work for advanced plans.

Strategic Impact Analysis

Connecting model performance to potential real-world value.

Estimated Impact (Illustrative): Based on the SMOTE-enhanced model's performance on this demo subset (Precision-Recall Area Under the Curve (PR AUC) ~0.35) and reported average readmission costs, scaling this pipeline could *potentially* identify an additional 5-10% of at-risk patients compared to simpler rules. This could translate to potential annual savings on the order of approx. £100k-£200k per 10,000 relevant patient discharges in a large hospital system by enabling targeted interventions (highly dependent on actual performance and intervention effectiveness).

Workflow Integration: The prediction score output by the API could be integrated into the EHR system as a clinical decision support alert for care coordinators or physicians during discharge planning.

See the Strategic Impact Documentation for a detailed breakdown.

Responsible & Explainable AI (XAI)

Integrating fairness analysis and interpretability techniques like SHAP (SHapley Additive exPlanations) for responsible and transparent AI.

SHAP analysis provides insights into individual feature impacts on predictions, aiding clinical trust and understanding.

Fairness analysis setup in src/visualisation/. See Visualisations for SHAP plots and Ethical AI for fairness discussion.

MLOps Implementation Status

Current state of the operational pipeline components

CI/CD & Tracking

  • CI Implementation (Basic): GitHub Actions workflow (.github/workflows/) currently implements automated linting (black, flake8) and formatting checks triggered on push.
  • Testing Framework: Basic unit test structure using pytest established in tests/.
  • Experiment Tracking (Conceptual): Designed for integration with MLflow for comprehensive tracking of parameters, metrics, code versions, and artefacts. Basic logging placeholders may exist in scripts.
  • CD Implementation (Conceptual): Design includes automated testing, model validation, and deployment triggers (e.g., API/dashboard updates), but these are not yet implemented in CI workflows.

Full CI/CD automation with model registry integration and deployment strategies (Shadow/Canary) are outlined in Future Work.

Monitoring & Deployment

  • Data/Concept Drift Monitoring (Design): Strategy involves tracking PSI and KS-tests for key features, and monitoring performance metrics (AUC, F1) over time. Conceptualised use of tools like Evidently AI or custom dashboards (Grafana).
  • Operational Monitoring (Basic): API includes basic health checks (/health) and request logging. Design includes monitoring latency, errors, and throughput via standard tools (e.g., Prometheus/Grafana).
  • Deployment (Implemented):
    • REST API serving predictions via FastAPI (api/).
    • Interactive dashboard via Streamlit (dashboard/).
    • Containerisation concepts using Docker outlined for portability.
  • Scalable Deployment (Conceptual): Future plans include Kubernetes orchestration or serverless deployments for scalability and advanced patterns (Blue/Green, Canary).

See Future Work Documentation for advanced MLOps design details.

Project Visualisations

Key insights from analysis and modelling on the MIMIC Demo dataset

Imbalance Handling Techniques Comparison

Addressing class imbalance is crucial in healthcare. In the MIMIC demo subset, readmissions represent only ~12% of cases (approx. 7.2:1 ratio). We compared several techniques:

Key Findings (Metrics):

  • Baseline Failure: Without handling imbalance, the model struggled to identify positive cases (low recall).
  • SMOTE Advantage: SMOTE achieved the highest F1 score (0.373), indicating the best balance between precision and recall on this demo dataset.
  • Recall vs. Precision: Class weights and random oversampling boosted recall significantly (>0.81) but resulted in low precision (<0.20).
  • Synthetic Data Benefit: SMOTE's synthetic sample generation appeared more effective than simple duplication (Random Oversampling).

See imbalance_analysis.py for details.

Interactive Precision-Recall (PR) Curve:

The PR curve illustrates the trade-off between precision (correct positive predictions out of all positive predictions) and recall (correct positive predictions out of all actual positives). A curve closer to the top-right indicates better performance.

Takeaway: This interactive visualisation shows SMOTE (green line) maintaining higher precision across a wider range of recall values compared to the baseline (grey dashed line), indicating its effectiveness in identifying readmissions without an excessive number of false alarms in this dataset.

Hover over the curves to explore thresholds. PR Area Under the Curve (AUC) summarises this trade-off.

Feature Importance Analysis (SHAP)

Understanding feature contributions using SHAP (SHapley Additive exPlanations) helps interpret model behaviour. The plot below shows the impact of features on the model's output (predicting readmission) for the baseline Logistic Regression model.

SHAP Summary Plot

Takeaway: Features like gender (`gender_m`, `gender_f`) and marital status (`marital_status_single`) show distinct impacts. Red points indicate higher feature values, blue indicates lower. Points further from the centre line have a larger impact on pushing the prediction towards (right) or away from (left) readmission. This provides global insights into feature influence.

See generate_shap_plots.py.

Time-Aware LSTM Model Performance

Performance of the Time-Aware LSTM PoC compared against the LightGBM baseline on the same train/test split of the *synthetic* temporal data created for this demonstration.

ROC and PR Curves for Time-Aware LSTM vs LightGBM

ROC and PR Curve Comparison

Takeaway: On this synthetic dataset, the Time-Aware LSTM (green) shows competitive performance against the LightGBM baseline (red), particularly noticeable in the PR curve (bottom plot) which is crucial for imbalanced data. The LSTM achieves a PR AUC of approx. 0.357, slightly outperforming LightGBM's 0.240, suggesting potential benefits from temporal modelling.

Training Curves for Time-Aware LSTM

Training Progress

Takeaway: The LSTM training loss (left plot, blue) decreases steadily over epochs, while the test loss (red) stabilises, indicating successful learning without significant overfitting on this data. Test Receiver Operating Characteristic Area Under the Curve (ROC AUC) and PR AUC (centre and right plots, green) show performance improving and plateauing, demonstrating effective learning within the configured epochs.

Process Enabling Reliable Performance: Our rigorous MLOps pipeline design, including automated data validation concepts and planned experiment tracking (e.g., using MLflow), allows systematic evaluation. Crucially, the performance metrics shown (e.g., PR AUC ~0.35) are derived from the limited MIMIC-III demo dataset using *synthetically generated* temporal data for this PoC. They illustrate the process's capability and potential of temporal models, not definitive real-world performance. The robust pipeline is designed to ensure reproducibility and improvement as data scales.

Ethical AI Considerations

Addressing bias and fairness in healthcare AI

Fairness Evaluation & Explainability

Building responsible AI requires actively measuring and addressing potential biases and ensuring model transparency.

Fairness Metrics Approach

Evaluation Strategy: We plan to evaluate fairness using established metrics like Equalized Odds Difference across sensitive attributes (e.g., gender, ethnicity) to quantify disparities in model performance between subgroups.

Demo Data Limitation: Due to the statistical limitations of the small MIMIC demo dataset, displaying subgroup fairness metrics could be misleading. Therefore, specific fairness plots (like Recall by Gender shown in the assets) are generated by the code but omitted here to uphold responsible reporting practices.

Mitigation/Monitoring Plan: While specific mitigation techniques (e.g., re-weighting) are planned for the full-scale implementation, the pipeline includes steps for generating these metrics (generate_fairness_plots.py) to enable continuous monitoring and ensure transparency around potential disparities when using larger datasets.

Explainability in Action (XAI)

SHAP Integration: We integrated SHAP (SHapley Additive exPlanations) to provide both global feature importance insights (understanding overall model behaviour) and local, per-prediction explanations.

Value Proposition: SHAP values allow clinicians to understand the key drivers (e.g., specific lab results, length of stay) contributing to an individual patient's predicted risk score, fostering trust and enabling more informed decision-making.

Global SHAP analysis provides feature importance insights (see Visualisations section).

Fairness-Performance Trade-offs: Addressing fairness often involves trade-offs with overall predictive performance. Our approach prioritises transparency and includes generating fairness metrics to inform stakeholder discussions on acceptable trade-offs in a clinical context (e.g., balancing the harms of false negatives vs. false positives across groups). This requires ongoing assessment.

For a detailed discussion of our ethical framework, see the Ethical Considerations documentation.

Future Directions & Enhancements

Leveraging cutting-edge AI for deeper insights

This PoC establishes a strong foundation. Future iterations could incorporate advanced AI techniques:

  • Large Language Models (LLMs) for Feature Extraction: Utilise models like ClinicalBERT to automatically extract structured clinical concepts (symptoms, findings) from unstructured notes, enriching the feature set.
  • Retrieval-Augmented Generation (RAG) for Explainability: Implement RAG systems using vector databases to retrieve relevant medical literature or similar patient cases, providing clinicians with evidence-based context alongside model predictions and SHAP explanations.
  • Transformer-Based Temporal Models: Explore architectures like BEHRT for potentially better modelling of long-range dependencies in patient histories compared to LSTMs.
  • Graph Neural Networks (GNNs): Model patient relationships or disease progression pathways as graphs to uncover complex relational patterns missed by sequence models.
  • Advanced Causal Inference: Implement techniques like Causal Forests or Targeted Maximum Likelihood Estimation (TMLE) for more robust causal effect estimation and understanding heterogeneous treatment effects.
  • Federated Learning: Investigate training models across multiple institutions without sharing raw patient data, enhancing privacy and generalisability.

See the Future Work Documentation for more detailed plans on these enhancements.