Image from Gebauer Company

Build and Run Machine Learning Projects in Just 5 Steps

As innovations such as Machine Learning, Cloud Computing, and Robotic Process Automation continue to have an impact in the health sector, there has been a continued push to make structured (i.e. electronic healthcare records) and unstructured (i.e. medical images, biosignals, etc.) healthcare data accessible to the public. Since many individuals and organizations in industry and academia are attempting to create learning models, it is vital to understand how to design a proper machine learning workflow that is tailored to the type of data and problem you are working with. This results in an organized, efficient product that produces desired results.

Step 1: Understand the Problem

Before wrangling the data and building your model, it is important to define your problem. Most inexperienced data scientists want to quickly get their hands on data, perform some basic EDA, then come to a problem they would like to solve. In an industry and academic setting, a well-defined problem and plan of operation lay the foundation for the rest of the project. It is important to have some domain expertise in order to know whether you should be dealing with balanced data or looking for outliers.

Introduction to Bayesian Decision Theory

There are three overarching categories for machine learning problems in the Healthcare space: Diagnosis, Prediction, Recommendations. Although there are many types of problems in machine learning, these three encompass the most in Healthcare.


The most common application in Healthcare ML is diagnosing, also known as classification. This supervised learning approach is able to determine whether a patient has a specific disease or illness given a set of features that describe their symptoms. The features can be presented in the form of tabular data, medical images, text, or signals. In some cases, the objective is to make a diagnosis between two classes (i.e. binary classification); however, there are times when multiple classes are taken into consideration (i.e. multi-class classification).


Another supervised learning approach, prediction seeks to answer questions related to quantity, likelihood, and any other continuous outcome. Similar to how a diagnosis is made, a prediction attempts to fit training data in order to determine the best possible answer. This category can be expanded further into topics like survival analysis, linear regression, and time series forecasting.


This category has been gaining popularity over the years. It utilizes the power of combinatorics to suggest items of interest to users of a given system. Health recommender systems are able to suggest the best plan of action given a patient’s symptoms and circumstances. The recommendation could range from which medication to take, doctor to see, or hospital to go to. It is a very powerful tool, but it does not have the same evaluation criteria as the categories above.

Image by ScienceSoft

Step 2: Gather the Data

Once the problem is defined and the plan is set in motion, it is time to wrangle the appropriate data. The healthcare sector generates a tremendous amount of data every day. Clinical data is a staple resource for most health and medical research. It is either collected during the course of ongoing patient care or as part of a formal clinical trial program. The most common example is the electronic health record (EHR), which collects information about the patient’s health from a number of sources. An EHR includes test results, clinical observations, diagnoses, current health problems, medications taken by the patient, the procedures they underwent. Likewise, text and image data are beginning to play a profound role in deep learning applications. Examples include medical images, handwritten prescriptions, and physician notes.

Since the healthcare data ecosystem is extremely complex and robust, whatever data you decide to use may come with a high storage requirement. Before you simply download a random CSV off a website, be sure to check the size of the files you are using. The smart approach would be to utilize the cloud. There are many companies that provide cheap, user-friendly software-as-a-service (SaaS) that can deploy a cloud database. Now, all you have to do is pull the information from whatever IDE you are using.

Step 3: Exploratory Data Analysis (EDA) & Pre-Processing

Data Cleaning

Since many EHR systems are still created and updated manually, there is room for human error and data quality issues. Spending time to clean your data will end up saving you a lot of time dealing with processing, training, testing, and evaluation issues. A proper data cleaning pipeline includes preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. Many beginner data scientists simply remove data that is not clean. Instead, attempt to manipulate and augment the data to preserve as much as you can.


Before you begin building your ML model, it is important to create informative visualizations, statistical tests, and relationship matrices curated towards your data and the problem at hand. EDA is a window into key characteristics like class imbalance, feature distributions, and correlation coefficients. While some see it as a waste of time, successful workflows have visualizations that can support the model’s output.

Correlation Matrix

Feature Extraction

Now that you know how the features relate to one another and with the target feature, it’s time to select features that can best encompass the variance in the data without significantly increasing the model’s complexity. There are numerous methods to select the best features, but that is outside the scope of this article.

Step 4: Build & Train

Determining which model to go forward with is not an easy decision, and it requires a lot of trial and error. Every model is unique and requires a balance between complexity and efficiency. According to the No Free Lunch Theorem, no one model works best for all possible situations, so it is best to test as many as you can.

“No one model works best for all possible situations”

Even if a certain model worked for a previous project of yours, it does not mean that the same model will be the best choice for your current project. Above all else, a data scientist is a SCIENTIST first and foremost. Therefore it is part of our job to experiment with every possible approach. Once you feel confident with your choice, it’s time to run it on the test set.

Supervised Learning

By far the most common category, supervised learning (SL) models use prior data with accurate labels for training. As a result, the model should have an understanding of the underlying class-conditional feature spaces. Some common SL algorithms include Logistic Regression, K-Nearest Neighbor, Support Vector Machines, and Random Forest.

Unsupervised Learning

Using unlabeled data, an unsupervised learning (UL) model is forced to find natural associations within the data. This eliminates the need for a training phase and instead uses a mathematical process to deduce grouping, different representations, and hierarchies. Common UL algorithms include K-Means, Dimensionality Reduction, Hierarchical Clustering, and Density-Based Clustering.

Semi-Supervised Learning

A gray area compared to the previous categories, semi-supervised learning (SSL) models use a mixture of labeled and unlabelled data to make classifications or predictions. The reason for this is because the model needs to properly understand the structure of the labeled data to make sense of the unlabeled data. A common application for an SSL approach is fraud detection. Since fraud is difficult to find and we do not know who is fraudulent or not, the model needs to understand the complexities of non-fraudulent entities in order to spot the abnormal ones.

Step 5: Evaluate the Model

Depending on the type of problem you have set out to solve, there are specific evaluation metrics to use that can give insight into how well your model is doing and what hyperparameters might need to be tuned. Evaluating a model is a holistic approach that’s not solely based on having high accuracy. There may be other metrics that hold more importance based on the problem you defined and the analysis you performed in Steps 1 & 2.

Classification Metrics

  • Accuracy: Out of all my predictions, how many did I get right?
  • Precision: Out of all the predictions I made for a certain class, how many did I get right?
  • Recall: What proportion of a certain class was identified correctly?
  • Receiver Operating Characteristic (ROC) Curve: Plot showing the performance of the model at different classification thresholds. The area under the curve (AUC) represents the model overall performance.
ROC Curve with AUC values

Regression Metrics

  • Mean Squared Error (MSE): The squared difference between the predicted value and the actual value
  • Root Mean Squared Error (RMSE): Square root of the MSE
  • Mean Absolute Error(MAE): The absolute value of the difference between the predicted value and the actual value

Ranking Metrics

  • Mean Reciprocal Rank (MRR): Average based on the location of the first relevant item
  • Mean Average Precision (MAP): For each relevant item, compute the precision based on the location of that item, then average across all users
  • Normalized Discounted Cumulative Gain (NDCG): Assuming that some items are more relevant than others, compute a weighted average.


Developing and Implementing a Machine Learning for a healthcare setting is not a quick process, but I hope this article gives you a foundation for knowing the obstacles you might come across. Keep in mind, even if you build your model and it works well, you still need to deploy it. Although deploying a Machine Learning system is very simple in practice, healthcare regulations and controls require you to take certain measures to ensure standardization and quality assurance.

Designing a Healthcare Machine Learning Workflow was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.