We developed and validated ML models for burn sepsis prediction using a retrospective dataset. The database was derived from a previous multicenter (5-site) randomized controlled trial (ClinicalTrials.gov#NCT01140269) evaluating the clinical impact of molecular pathogen detection in burn sepsis patients where vital signs and laboratory data was recorded daily for the duration of each patient’s intensive care unit (ICU) stay (Supplemental Data Fig. S1)10. Vital signs and laboratory data was consistently collected per study protocol. Contributing sites were ABA verified burn centers in the United States. Human subjects’ approval was obtained at each study site and through the United States Army Human Research Protection Office. From this study population, we used the data to help predict sepsis when analyzed using traditional statistics, as well as ML using traditional non-automated programming, and then compared against our novel automated ML approach. Study methods are described below.

Study population

The database consisted of 211 adult (age ≥ 18 years) patients with ≥ 20% total body surface area (TBSA) burns enrolled from across five United States academic hospitals. Patients with non-survivable injuries or lacking the ability to provide informed/surrogate consent were excluded. Relevant medical information including patient demographics, vital signs (i.e., heart rate, respiratory rate, systolic/diastolic blood pressure, mean arterial pressure, central venous pressure), laboratory results (i.e., blood gas indices, complete blood count, chemistry panels, coagulation status, and microbiology results), Glascow Coma Score (GCS), medical/surgical procedures (e.g., surgery, intravascular line placement/removal), mechanical ventilator settings, and prescribed antimicrobial medications were recorded daily over the course of their ICU stay. Outcome measures include sepsis status, and mortality were also recorded. Sepsis status was based on the 2007 ABA Consensus Guidelines1. The study population included patients with respiratory, urinary tract, soft tissue, and/or bloodstream infections. Recorded data variables are outlined in Table 2 and were used for curating the data to determine sepsis status, as well as for performing traditional statistical analyses, and ML model development and generalization.

Table 2 Daily recorded variables for enrolled subjects.

Traditional machine learning method

Machine learning sepsis algorithms were first developed using our exhaustive “traditional” non-automated ML approach8,11. The process entailed manually selecting various feature set combinations aided with an unsupervised select percentile techniques such as ANOVA F-classification for feature selection from the original dataset followed by building a large number of models on various supervised algorithms. The five ML supervised algorithms employed for this task included: (a) logistic regression (LR), (b) k-nearest neighbor (k-NN), (c) random forest (RF), (d) support vector machine (SVM), and our multi-layer perceptron deep neural network (DNN). Scikit-Learn’s version 0.20.2 was used to construct models as in previous studies11. Cross validation and hyperparameter tuning studies were also performed for LR, RF, k-NN, SVM, and DNN methods using the Scikit-learn cross validation and grid search tools. This technique along with the grid search hyperparameter variations allowed us to develop and compare a large number (49,940) of unique models based on various feature set combinations (identified from our select percentile feature selection) within our five ML methods/algorithms. This approach enabled us to empirically assess and compare all models and to identify the best performing ML model for a given set of unique hyperparameters and feature set combinations.

Automated ML (auto-ML) platform

In addition to the above manual ML approach, we developed the Machine Intelligence Learning Optimizer (MILO) platform to perform a similar task in a fully automated fashion (Fig. 1)12. The MILO infrastructure includes an automated data processor, a data feature selector (ANOVA F select percentile feature selector and RF Feature Importances Selector) and data transformer (e.g., principal component analysis), followed by custom supervised ML model building using our custom hyperparameter ranges used with search tools (i.e., grid search and random search tools) to help identify the optimal hyperparameter combinations for DNN, LR, naïve Bayes (NB), k-NN, SVM, RF, and XGBoost gradient boosting machine (GBM) techniques. Additionally, MILO enables addition of other algorithms and hyperparameter combinations—allowing us to easily add in NB and GBM for analysis.

Figure 1

Machine intelligence learning optimizer: the MILO auto-machine learning (ML) infrastructure consists of begins with two datasets: (a) balanced data (Data Set 1) set used for training and validation, and (b) an unbalanced dataset (Data Set 2) for generalization. MILO removes missing values, assessed and scaled by the software. Unsupervised ML is then used for feature selection and engineering. The generated models are trained and then tested with the Data Set 1 during the supervised ML stage. Primary validation is then performed using Data Set 1 and followed by generalization using Data Set 2. Selected models can then be deployed thereafter as predictive model markup language (PMML) files.

Following the training and validation of models, MILO executes an automated performance assessment with results exported for user viewing. In the end, MILO employs a combination of unsupervised and supervised ML platforms from a large set of algorithms, scalers and feature selectors/transformers to create greater than 1,000 unique pipelines (i.e., set of automated machine learning steps)—ultimately generating > 100,000 models that are then statistically assessed to identify optimal algorithms for use (Supplemental Data Table S1).

For this study, we imported the trial data into MILO using sepsis status as the outcome measure for analysis. The following functions are then performed automatically by MILO. First, rows with any missing values are removed (e.g., laboratory results that were not performed for a given day). Next the information is assessed to ensure model training and the initial validation step is based on a balanced dataset. A balanced dataset is used for training because the system was built to work with small amounts of training data and since accuracy was a scoring discriminator, the measured accuracy can then be better assessed against a lower null accuracy baseline which ultimately would minimize overfitting. This balanced dataset is then split into training and validation test sets in an 80–20 split, respectively. Since many algorithms benefit from scaling, in addition to using the unscaled data, the training dataset also underwent two possible scaling transformations (i.e., standard scaler and minmax scaler). To evaluate the effect of various features within the datasets, a combination of various statistically significant feature was then selected to build new datasets with less or transformed features. The features selected in this step are derived from several well-established unsupervised ML techniques including ANOVA F-statistic value select percentile, RF Feature Importances or transformed using our principle component analysis approach9. A large number of ML models are then built from these datasets with optimal parameters on large number of pipelines which include a combination of various algorithms (i.e., DNN, SVM, NB, LR, k-NN, RF, and GBM), scalers, hyper-parameters, and feature sets. All pipelines generated by MILO undergo generalization assessment no matter their performance. For model validation, MILO creates and assesses hundreds of thousands of models. All models for each category are then identified and passed onto the next phase of the software pipeline for generalization assessment. Machine learning model performance data is then tabulated by MILO and reported as clinical sensitivity, specificity, accuracy, F1 score, receiver operator characteristic (ROC) curves, and reliability curves. Finally, to evaluate if ABA Consensus Guidelines1 and Sepsis-3 criteria4 are compatible with ML applications, MILO algorithms were generated using their respective parameters (ABA Consensus Guidelines: body temperature, respiratory rate, and heart rate, platelet count (PLT), and glucose; Sepsis-3: respiratory rate, PaO2/FiO2, GCS, PLT, MAP, and total bilirubin) and compared against models evaluating optimized features from the dataset. American Burn Association Consensus Guideline1 criteria such as insulin rates/resistance, intolerance to enteral feedings, abdominal distension, and uncontrollable diarrhea were not available for the study dataset.

Traditional statistical analysis

JMP software (SAS Institute, Cary, NC) was used for statistical analysis. Descriptive statistics were calculated for patient demographics. Data was also assessed for normality using the Ryan-Joiner Test. Continuous parametric variables were analyzed using the 2-sample t-test, while discrete variables were compared using the non-parametric Chi-square test. As appropriate, continuous non-parametric variables, the Mann–Whitney U Test was used. Multivariate LR was used to determine predictors of sepsis with age and burn size serving as covariates with 95% confidence intervals (CI) reported. Repeated measures ANOVA was used for time series data. A P-value < 0.05 was considered statistically significant with ROC analysis also performed to compare sepsis biomarker performance. Bootstrapping (minimum of 2,500 bootstrap samples) via JMP software was employed to calculate 95% confidence intervals (CI) for the area under the ROC curves.