COVID-19 surge preparedness with AI, genomic surveillance

New and emerging variants of SARS-CoV-2 virus continue to pose a threat to the health of populations across the globe. Counting the unique and observable changes in the sequences available until January 2022 shows more than 6,000 mutations have accumulated in the spike gene of the virus. Initial studies claimed SARS-CoV-2 to be a fast-mutating virus which may make the virus fitter over time. More recent studies have estimated moderate substitution rates of the whole genome at 0.00067 and the spike gene at 0.00081 substitutions per site per year, respectively. A preprint claimed that the fitness of the SARS-CoV-2 virus is increasing because of the natural phenomenon of purifying selection of the spike protein.

Open-data initiatives

These observations when seen in the light of newer waves of infection such as in Delta and the Omicron surges highlight the crucial need to use genomic features to predict surges. Consortia and open-data initiatives across the globe, such as the Indian SARS-CoV-2 Genomics Consortium (INSACOG) and GISAID have been instrumental for identification of new variants. However, most of the inferences from genomic surveillance have so far been retrospective in nature — explaining the past rather than predictive of the future.

Most of the currently available predictive models are variations of standard epidemiological models such as compartmental or agent-based models, which predict the future trends based on the reported infections and deaths. These models do not incorporate features from the virus sequences in a predictive manner. Our recently published model, Strainflow, plugs this gap by taking a sequence-driven approach to predict future surges using a novel artificial intelligence pipeline. This study was based on a simple hypothesis — virus sequences can be treated as documents that can be read like a book by natural language understanding (NLU) models. Further, the models can discover the underlying “grammar” patterns which are causally predictive of future surges.

The three-base codons in these sequences were treated as words in the document with each nucleotide base as a letter. A caveat with NLU models is that these need millions of documents for training. Fortunately, GISAID plugged this gap for us. In my lab at IIIT-Delhi, we processed more than 2.8 million high-quality sequences from December 2019 to January 2022. These included data from 17 countries, including India. We experimented with several NLU models optimised for efficiently learning the “grammar of Spike gene”. The best model compressed the viral sequences in 36 dimensions, technically known as low-dimensional embeddings. Each of these 36 dimensions is a different cocktail mix of codon level relationships. We proposed that some of these 36 cocktail mixtures may encode the patterns that make the virus spread faster.

Clear temporal patterns

We then trained models to extract the dimensions that predicted the number of cases in the 17 countries with a two-month lead time. To our surprise, our models were accurate in predicting the surges during the Delta, Omicron, and the current surges we are seeing in India. We examined the features of these cocktail mixes and found that the diversity of some of these, technically captured as a quantity known as Entropy showed clear temporal patterns predictive of surges. The entropy was seen to start dipping sharply up to two months before the cases surge. Although we have not tested this, we propose that this is biologically plausible because the virus strain evolution takes an explore-exploit pattern — exploring the possible combinations of mutations and then exploiting some to establish itself as the dominant strain.

A natural question then arises: can we predict the actual number of cases using Strainflow? Although possible in the future, our current models do not support this as exact case numbers may be dependent on many other factors such as social behaviors, vaccination, mask mandates and lockdowns. Although the Strainflow approach does not predict the actual number, we do get an accurate sense of how sharp the surge might be. The Strainflow model has proven to be effective for predicting whether there is a likely surge with a two-month lead time, which could help the healthcare systems to be prepared. The Strainflow model was used for creating an interactive dashboard, updated monthly and publicly available at (http://strainflow.tavlab.iiitd.

edu.in/)

De-novo approach

A key feature of Strainflow was its data-driven, de-novo approach without the need for expert understanding of what the individual mutations may entail. Although experts can make guesstimates, our model uses complex mixtures for predictions. This is a difficult task for the human brain which cannot keep many information piecesin attention. The team that was involved in creating Strainflow included undergraduate, graduate, doctoral and post-doctoral students and exemplified a diverse set of backgrounds in computer science, biology and medicine.

A key factor for the successful construction and execution of the Strainflow model and dashboard was the support and encouragement received from the City Knowledge Innovation Cluster Delhi Research Implementation and Innovation (CKIC-DRIIV), an initiative of the Office of the Principal Scientific Adviser.

What does the future hold for AI, models and COVID-19 preparedness? Our strongest learning from this exercise has been the power of open data and interdisciplinary thinking. Strainflow is just one piece in the puzzle for solving the surge-preparedness of COVID-19 which could lay the foundation for general infectious disease preparedness. Imagine the future as multiple signals such as Strainflow, traditional epidemiological models, testing results and non-conventional sources such as mobility, demographics, social media signals feeding into a unified model for preparedness. This is exactly our current endeavour, with projects geared towards district level preparedness in collaboration with the ICMR and understanding the relationship of mobility patterns and strain emergence in collaboration with Meta (formerly Facebook) as a part of their Data for Good initiative.

The famous physicist, Niels Bohr had remarked “Prediction is very difficult, especially if it's about the future!”, But we hope to change it, especially for a smaller subset of important problems in healthcare, one model at a time.

(Tavpritesh Sethi, Associate Professor, Department of Computational Biology, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), Faculty Lead, TavLab, and Founding Head, Centre of Excellence in Healthcare.)

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.

COVID-19 surge preparedness with AI, genomic surveillance
Premium