The problems of prediction

As the pandemic broke, in a Harvard Business Review article of April 2020, Julie Shah, an AI researcher and a roboticist at MIT, and Neel Shah, a physician at a major hospital and a public health researcher at Harvard, perceived that this would be our most meaningful Big Data and analytics challenge yet. “We have the people. We have the data. We have the computational force. We need to deploy them now,” they wrote. In terms of the volume of data, COVID-19 is something like Alibaba’s cave. Many data scientists, statisticians, computer analysts, and epidemiologists got engaged in the fashionable exercise of predicting the trajectories of the number of cases and deaths, different waves of the pandemic and their peaks. Many of the predictions were contradictory in nature and were proved to be misleading eventually, although they were mostly done by academics of reputed universities and institutes. But why did such predictions fail?

A dubious track record

In a paper in the International Journal of Forecasting in August 2020, John P.A. Ioannidis, Sally Cripps, and Martin A. Tanner recalled that epidemic forecasting has a dubious track record. The failures became more prominent with COVID-19. They pointed out several important causes of the failure of such forecasting: poor data input, wrong modelling assumptions, high sensitivity of estimates, lack of incorporation of epidemiological features, poor past evidence on effects of available interventions, lack of transparency, errors, lack of determinacy, looking at only one or a few dimensions of the problem at hand, lack of expertise in crucial disciplines, selective reporting, etc.

The models used in most COVID-19 predictions were simple statistical models like regression or time series models, or standard epidemiological models such as SIR, SEIR, or their simple variants. These are well-studied and are also included in standard software packages. Thus, most COVID-19 data analysts didn’t do much beyond feeding the classified data on daily cases and deaths into the computer. Software immediately provided forecasts along with fancy graphs.

Each of these models depends on some underlying assumptions on the nature of the disease and the behaviour of the concerned community. The dynamics of a new and unknown disease, however, maybe far more complicated and unpredictable. And if the assumptions are not satisfied in reality, the predictions might become weird. The nature of COVID-19 was unknown from the beginning. Even now, there are many unknowns concerning the disease. New variants of the virus, effectiveness and coverage of the vaccines, etc. play important roles. So, which of the existing models is really ‘suitable’ for predicting COVID-19? It’s difficult to answer.

Quality of data

Although waves of data are available, official death tolls from COVID-19 are likely to be a “significant undercount”, the World Health Organization (WHO) said in May, estimating that the true figure of direct and indirect deaths in many countries “would truly be two to three times higher”. The WHO estimated double the reported COVID-19 deaths in the European region, having relatively reliable reporting systems, during 2020. An unknown but huge number of asymptomatic cases, inadequate and inefficient testing facilities, overwhelmed hospital infrastructure, social stigma, etc. may be some of the reasons. The Centers for Disease Control and Prevention (CDC) in the U.S. even issued an official warning about the quality of data: “Counting exact numbers of COVID-19 cases is not possible because COVID-19 can cause mild illness, symptoms might not appear immediately, there are delays in reporting and testing, not everyone who is infected gets tested or seeks medical care, and there are differences in how completely states and territories report their cases.” CDC even uses an ‘ensemble’ forecast which combines many independently developed forecasts into one to improve prediction.

Thus, almost everywhere, the reality of COVID-19 predictions is restricted to forecasting official figures based on official data, using some possible wrong model. The dimensions of errors are unknown both in the model and data. Policymakers may still need some such predictions to plan for healthcare infrastructure. But it’s unclear why many academics, who should understand that they’re engaged in useless exercises, are busy in the prediction business. The horror or false sense of security caused through these wrong predictions is undesirable.

In their 2020 paper, Ioannidis, Cripps, and Tanner opined that some, but not all, of the problems of epidemiological forecasting can be fixed. Careful modelling of predictive distributions rather than focusing on point estimates, considering multiple dimensions of impact, and continuously reappraising models based on their validated performance may help.

Atanu Biswas is Professor of Statistics at the Indian Statistical Institute, Kolkata

Our code of editorial values

This article is closed for comments.
Please Email the Editor

Printable version | Sep 28, 2021 10:22:04 PM |

Next Story