Showcasing an enterprise-ready Machine Learning Operations (MLOps) pipeline for clinical readmission prediction. This strategic Proof-of-Concept (PoC), using the Medical Information Mart for Intensive Care (MIMIC)-III Demo data, emphasises methodological rigour, end-to-end best practices, and production readiness for reliable healthcare Artificial Intelligence (AI).
A production-ready MLOps pipeline for healthcare prediction
This project presents a Strategic Proof-of-Concept (PoC) focused on building an MLOps pipeline using the MIMIC-III Clinical Database Demo (v1.4). While the demo dataset limits the statistical power of the results, it provides a complex, realistic environment to demonstrate an enterprise-ready MLOps architecture capable of handling real-world healthcare data challenges, specifically targeting the Electronic Health Record (EHR) complexity. The primary goal is showcasing methodological rigour, software engineering best practices, and an end-to-end workflow designed for reliability, reproducibility, and scalability.
Transitioning from this PoC to the full MIMIC dataset or a production EHR system involves addressing key scalability challenges:
Drift detection and performance monitoring must handle high throughput. This involves:
Resource | Demo Implementation | Full-Scale Estimate | Scaling Factor |
---|---|---|---|
Compute (Training) | 4-core CPU, 16GB RAM | 32+ core Dist. Cluster, 128GB+ RAM, Multi-GPU | 8-16x+ |
Storage (Inc. Versions) | ~5GB | ~2-5TB+ | 400-1000x+ |
Inference Throughput | ~50 req/sec (Single Node) | 1000+ req/sec (Auto-scaling Cluster) | 20x+ |
Est. Monthly Cloud Cost | ~$50 (~£40) | ~$2,500 - $10,000+ (~£2,000 - £8,000+) | 50-200x+ |
The project focuses on three key prediction tasks in critical care:
A modular, scalable architecture designed for research and production
The project follows a structured MLOps pipeline from data ingestion to deployment:
MIMIC-III clinical database with patient demographics, vital signs, lab values, medications, procedures, and diagnoses.
Robust Extract, Transform, Load (ETL) pipeline with data cleaning, preprocessing, and transformation.
Comprehensive feature extraction, including temporal patterns and domain-specific transformations.
Training, hyperparameter tuning, evaluation, and interpretation with a focus on handling class imbalance.
REST Application Programming Interface (API), interactive dashboard, and continuous monitoring concepts for model performance and data drift.
mimic-readmission-predictor/ ├── api/ # FastAPI implementation ├── assets/ # Generated plots & results ├── configs/ # Configuration files (YAML) ├── dashboard/ # Streamlit dashboard ├── data/ # Raw, processed data ├── docs/ # Project documentation ├── models/ # Saved model artefacts ├── notebooks/ # Exploration & PoCs ├── src/ # Source code (ETL, features, models) │ ├── data/ │ ├── features/ │ ├── models/ │ ├── visualisation/ │ └── utils/ ├── tests/ # Unit & integration tests ├── .github/ # CI/CD workflows (conceptual) ├── .gitignore ├── FUTURE_WORK.md # Detailed future enhancements ├── index.html # This page ├── LICENSE ├── README.md └── requirements.txt
The project follows a modular structure facilitating maintainability, testability, and adherence to MLOps principles.
For a detailed view of the architecture, including component descriptions and data flow, see the Architecture Documentation.
Advanced techniques and pipeline design concepts
Exploring the sequential nature of EHR data using a Time-Aware Long Short-Term Memory (LSTM) with Attention PoC (see notebooks/time_aware_lstm.py
). This approach aims to capture irregular time intervals and identify critical time points.
class TimeAwareLSTM(nn.Module): # ... (Initialisation) ... def forward(self, x, time_intervals): # Incorporate learned time embeddings time_encoding = self.time_encoder(time_intervals) x_combined = torch.cat([x, time_encoding], dim=2) # Pack, LSTM, Unpack... lstm_out, _ = self.lstm(...) # Attention mechanism identifies key time steps context, attn_weights = self.attention(lstm_out) # ... Classifier ...
This model explicitly incorporates time dynamics, a critical factor often ignored by static models. Performance evaluation on the demo dataset is presented in the Visualisations section.
Systematic analysis of techniques for handling significant class imbalance (7.2:1 ratio in demo data). Compared Baseline, Class Weights, Random Oversampling, SMOTE (Synthetic Minority Over-sampling Technique), and Random Undersampling using cross-validation.
Key Finding: SMOTE demonstrated the best-balanced performance (F1-score) on the demo dataset, effectively improving recall without excessively sacrificing precision.
See src/models/imbalance_analysis.py
and Visualisations for implementation and results.
Exploring techniques to estimate the causal effect of interventions or features on readmission, moving beyond simple correlations found in observational data like MIMIC.
Technique Example: Utilised Doubly Robust Estimation (via libraries like EconML) to estimate the Average Treatment Effect (ATE) of a specific intervention (e.g., prior circulatory diagnosis) while controlling for measured confounders (specifically `age` in this PoC) present in the observational MIMIC data.
Preliminary Finding (Highly Illustrative): Initial analysis on this demo subset suggests the intervention has an estimated ATE of -0.03 (a 3% reduction) on 30-day readmission probability. This is purely illustrative due to demo data limitations and minimal confounder control.
Definitive causal claims require significant further work, including larger datasets, rigorous sensitivity analysis, and exploring alternative identification strategies. See notebook
for PoC.
Designing an end-to-end MLOps framework including Continuous Integration/Continuous Deployment (CI/CD), monitoring, and deployment strategies for reliability at scale.
Code Push (Git)
Automated Tests
Model Retrain
Validation
Deploy (API/Dash)
See MLOps Status for current implementation details and Future Work for advanced plans.
Connecting model performance to potential real-world value.
Estimated Impact (Illustrative): Based on the SMOTE-enhanced model's performance on this demo subset (Precision-Recall Area Under the Curve (PR AUC) ~0.35) and reported average readmission costs, scaling this pipeline could *potentially* identify an additional 5-10% of at-risk patients compared to simpler rules. This could translate to potential annual savings on the order of approx. £100k-£200k per 10,000 relevant patient discharges in a large hospital system by enabling targeted interventions (highly dependent on actual performance and intervention effectiveness).
Workflow Integration: The prediction score output by the API could be integrated into the EHR system as a clinical decision support alert for care coordinators or physicians during discharge planning.
See the Strategic Impact Documentation for a detailed breakdown.
Integrating fairness analysis and interpretability techniques like SHAP (SHapley Additive exPlanations) for responsible and transparent AI.
SHAP analysis provides insights into individual feature impacts on predictions, aiding clinical trust and understanding.
Fairness analysis setup in src/visualisation/
. See Visualisations for SHAP plots and Ethical AI for fairness discussion.
Current state of the operational pipeline components
.github/workflows/
) currently implements automated linting (black
, flake8
) and formatting checks triggered on push.pytest
established in tests/
.Full CI/CD automation with model registry integration and deployment strategies (Shadow/Canary) are outlined in Future Work.
/health
) and request logging. Design includes monitoring latency, errors, and throughput via standard tools (e.g., Prometheus/Grafana).api/
).dashboard/
).See Future Work Documentation for advanced MLOps design details.
Key insights from analysis and modelling on the MIMIC Demo dataset
Addressing class imbalance is crucial in healthcare. In the MIMIC demo subset, readmissions represent only ~12% of cases (approx. 7.2:1 ratio). We compared several techniques:
See imbalance_analysis.py
for details.
The PR curve illustrates the trade-off between precision (correct positive predictions out of all positive predictions) and recall (correct positive predictions out of all actual positives). A curve closer to the top-right indicates better performance.
Takeaway: This interactive visualisation shows SMOTE (green line) maintaining higher precision across a wider range of recall values compared to the baseline (grey dashed line), indicating its effectiveness in identifying readmissions without an excessive number of false alarms in this dataset.
Hover over the curves to explore thresholds. PR Area Under the Curve (AUC) summarises this trade-off.
Understanding feature contributions using SHAP (SHapley Additive exPlanations) helps interpret model behaviour. The plot below shows the impact of features on the model's output (predicting readmission) for the baseline Logistic Regression model.
Takeaway: Features like gender (`gender_m`, `gender_f`) and marital status (`marital_status_single`) show distinct impacts. Red points indicate higher feature values, blue indicates lower. Points further from the centre line have a larger impact on pushing the prediction towards (right) or away from (left) readmission. This provides global insights into feature influence.
Performance of the Time-Aware LSTM PoC compared against the LightGBM baseline on the same train/test split of the *synthetic* temporal data created for this demonstration.
Takeaway: On this synthetic dataset, the Time-Aware LSTM (green) shows competitive performance against the LightGBM baseline (red), particularly noticeable in the PR curve (bottom plot) which is crucial for imbalanced data. The LSTM achieves a PR AUC of approx. 0.357, slightly outperforming LightGBM's 0.240, suggesting potential benefits from temporal modelling.
Takeaway: The LSTM training loss (left plot, blue) decreases steadily over epochs, while the test loss (red) stabilises, indicating successful learning without significant overfitting on this data. Test Receiver Operating Characteristic Area Under the Curve (ROC AUC) and PR AUC (centre and right plots, green) show performance improving and plateauing, demonstrating effective learning within the configured epochs.
Addressing bias and fairness in healthcare AI
Building responsible AI requires actively measuring and addressing potential biases and ensuring model transparency.
Evaluation Strategy: We plan to evaluate fairness using established metrics like Equalized Odds Difference across sensitive attributes (e.g., gender, ethnicity) to quantify disparities in model performance between subgroups.
Demo Data Limitation: Due to the statistical limitations of the small MIMIC demo dataset, displaying subgroup fairness metrics could be misleading. Therefore, specific fairness plots (like Recall by Gender shown in the assets) are generated by the code but omitted here to uphold responsible reporting practices.
Mitigation/Monitoring Plan: While specific mitigation techniques (e.g., re-weighting) are planned for the full-scale implementation, the pipeline includes steps for generating these metrics (generate_fairness_plots.py
) to enable continuous monitoring and ensure transparency around potential disparities when using larger datasets.
SHAP Integration: We integrated SHAP (SHapley Additive exPlanations) to provide both global feature importance insights (understanding overall model behaviour) and local, per-prediction explanations.
Value Proposition: SHAP values allow clinicians to understand the key drivers (e.g., specific lab results, length of stay) contributing to an individual patient's predicted risk score, fostering trust and enabling more informed decision-making.
Global SHAP analysis provides feature importance insights (see Visualisations section).
For a detailed discussion of our ethical framework, see the Ethical Considerations documentation.
Leveraging cutting-edge AI for deeper insights
This PoC establishes a strong foundation. Future iterations could incorporate advanced AI techniques:
See the Future Work Documentation for more detailed plans on these enhancements.