Corey Jaskolski is the founder and CEO of Synthetaic, the leading synthetic data company for impossible AI.

getty

One of the biggest miscalculations made around Covid-19 last year is we assumed that because it was a novel virus, we were starting from a place of zero data. This caused a number of problems. As researchers worked to understand the significance of the limited data we had at any given time, there was little clarity on the subsequent recommendations and steps that should be taken.

Instead, decision-makers at every level began the slow process of gathering data around the virus to inform the next steps. The lack of data left many waiting to hit the point where it was deemed there were enough data to dictate responsible policy, regulations and treatments.

This paradigm of waiting for enough data was particularly challenging in developing AI models in applications for uses such as Covid-19 screenings, which are incredibly data-hungry. If we had instead leveraged synthetic data from the outset to develop datasets early, we could have been able to move more quickly and with more certainty in understanding the virus and how to effectively treat it. Synthetic data could have improved decision-making in a time of global crisis.

The Danger Of Data-Sparse Models

On the website Papers with Code, there are hundreds of datasets that AI, machine learning and data science practitioners benchmark against. The site provides free open resources on machine learning papers, code and algorithm performance rankings. It’s a valuable site for data scientists, allowing them to test new algorithms and set benchmarks.

One popular dataset, ImageNet, contains several million images in 1,000 categories (e.g., dog, car and airplane). ImageNet is used to evaluate nearly every new AI algorithm. When a new model trained on a dataset like this achieves even 0.1% higher accuracy, everyone in the AI field knows it.

Utilizing these massive, general image datasets like ImageNet, benchmarking a new algorithm and testing it reliably is easier for models that aim to detect or classify common objects. This is because each new algorithm is being compared to the previous state-of-the-art ones. However, these existing datasets don’t help when the model is built to detect something that’s novel or rare, such as Covid-19 chest X-rays in the early days of the pandemic.

To fill this gap, one valuable resource that emerged from the website last spring was the COVIDx dataset. This dataset was compiled by the authors of COVID-ResNet, a deep-learning framework for the screening of Covid-19 from radiographs. This project of Muhammad Farooq and Abdul Hafeez aimed to develop a model to identify the presence of Covid-19 in chest X-rays and distinguish it from bacterial and viral pneumonia.

Understanding that research had shown certain abnormalities in chest X-rays of Covid-19 patients, Farooq and Hafeez sought to build open-source and open-access datasets to develop deep learning models for differentiating Covid-19 cases from pneumonia cases. Medical data like this is often not shared with the research community, but the COVIDx dataset utilized available X-ray images from medical journals to create the image set used to develop the models.

The limited images available, however, made this approach challenging. Specifically, although the dataset contains data for 1,203 normal chest X-rays, 931 patients with bacterial pneumonia and 660 patients with (non-Covid-19) viral pneumonia, it only has images for 45 Covid-19 patients. The models subsequently developed could only be benchmarked against the small dataset, and even as the media picked up on the approach, the dataset size was so small that it was shown that for many of the models, the successful results were more due to luck than an actual well-trained AI.

COVIDx was a valuable start, but with only 45 Covid-19 patients and the typical practice of holding out 10% to 20% of the data, this set could only yield around five to 10 images for testing. This means any model that classifies even one extra image correctly by chance could show 20% greater performance. However, if researchers instead ran these models against large synthetic datasets and held out more of the ground truth data for validation and testing, we could start to get to the accuracy needed for these moments of acute diagnostic need.

Anticipatory Modeling With Synthetic Data

All this highlights the need for us to be better data-prepared for the next global crisis so we can start on better footing with the tools for fast screening. The truth is that we needn’t wait on data when we can generate it digitally, and we have the capabilities to do that now. We just need the proper protocols in place to use this data effectively.

If there had already been a pipeline in place when the Covid-19 crisis began, it could have saved valuable time. Yet, we never assumed we’d be so data-poor. As a result of this crisis, we at least have more capability to act quickly now. But we’re still not optimally preparing for those other data-sparse inevitabilities.

We need to figure out how to deal with sparse data because that’s the world in which we’re working. This, of course, doesn’t just apply to virus screening. It’s relevant to any number of potential areas in which new challenges will arise and we don’t yet have the real-world data to adequately form a response. For example, data sparsity presents particularly unique challenges for rare diseases that we may be able to treat if we had more data, but also for the challenges we face in relation to climate change and conservation. Collecting the necessary data to address these emerging issues too often isn’t functionally feasible, so we can’t even begin working on potential solutions. Synthetic data allows us to test frameworks and set benchmarks ahead of a crisis, and move more quickly when the unexpected strikes.

When you develop tools with sparse data, you’re forcing yourself to solve the hard questions immediately, against a ticking clock. As we face increasingly new and difficult global challenges, having strong synthetic data capabilities will allow us to build effective responses more quickly, when time is paramount and lives are on the line.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?