Too much and too little, Data

Artificial Intelligence has the potential to be an essential tool against COVID-19 pandemic. However, as Georgios Petropoulos at Bruegel concludes, “AI systems are still at a preliminary stage, and it will take time before the results of such AI measures are visible.” It has been studied that the current use of AI is actually limited by a lack of data, and on the other hand, by too much data. There is a lack of historical data in order to train AI models, but also the potential problems of big data hubris, non-adjustment of algorithms, and a deluge of outlier data which needs to be evaluated before eventually being put through clinical trials.

Firstly, more new training data is explicitly needed on COVID-19; more openness and sharing of information is required, and more collaborative and multidisciplinary research is necessary to improve the ability of AI.

So far, there has been promising progress with a number of notable activities recognizing the importance of building and sharing existing datasets and information about the epidemic. One of the first has been the World Health Organization’s (WHO) Global Research on Coronavirus disease database, with links to other similar initiatives.

Kaggle, a data science competitive platform, has issued a data competition based on this data, a “COVID-19 Open Research Dataset Challenge”. Zindi, Africa’s greatest data competition platform, has also launched a competition to “accurately predict the spread of COVID-19 around the world over the next few months”.

Google has made available (until 15 September 2020) COVID-19 Public Datasets on its Cloud, and Amazon has launched a public AWS COVID-19 data lake , which it describes as “a centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19”.

It is not only the large tech companies, publishers, and universities (UC Berkeley) that are promoting open access to data and scientific literature on COVID-19, but also smaller start-ups and NGOs. For instance, Newspeak Housea UK based independent residential college — has started a crowdsourcing initiative, a Coronavirus Tech Handbook, to which it has invited the public to contribute.

As the pandemic progresses and dominates the news and social media, too much big data noise and outlier data is created, and algorithms will be overwhelmed. Scientists will have to deal with the deluge of scientific papers and new data being generated. More than 100 scientific articles on the pandemic now gets published daily. This potential information overload is where data analytic tools can play an important role. An example of such initiative is the COVID-19 Evidence Navigator, which provides computer-generated evidence maps of scientific publications on the pandemic, daily updated from PubMed.

Gruenwald et al.’s COVID-19 Evidence Navigator, 1st April 2020


AI isn’t yet playing a significant role in fighting against COVID-19, at least from the epidemiological, diagnostic and pharmaceutical points of view. Its use is limited. The creation of unbiased time series data for AI training is very essential. A growing number of international initiatives in this regard is encouraging; however, there is an imperative for more diagnostic testing. Not only for providing training datasets to get AI models operational, but moreover for more effectively managing the situation and reducing its cost in terms of human lives and economic damage.

More diagnostic testing will be helpful to eventually halt the pandemic, limit the economic damage from lockdowns, and avoid a rebound once restriction are relaxed. At present, we just do not know how many people are infected. In essence, it may be, as a study in Science suggests, that 86 percent of all infections may be undocumented. If this is the case, then the danger of a rebound of the pandemic is highly likely. Thus, overcoming this limited data in terms of who is infectious is critical.

Finally, data is pivotal in regard to whether AI will be an effective tool against future epidemics and pandemics. The fear is, as already mentioned, that public health concerns would outshine data privacy concerns. Governments may want to continue the extraordinary surveillance of their citizens long after the pandemic is over. Thus, concerns about the erosion of data privacy are justified.

Further discussion of the legal and ethical dimensions of data management falls outside the scope of this article.

Important links and references: