Introduction
COVID-19 a novel Corona Virus Disease which was caused by SARS-CoV-2 (Severe Acute Respiratory Syndrome Corona Virus 2) continues to pose a critical and urgent threat to global health. In December 2019, the first case of COVID-19 was discovered in the Hubei province of the People’s Republic of China and spread worldwide (UMBERS 2020). Since its discovery, the disease has spread around the world and in March 2020 the World Health Organization (WHO) declared it a pandemic as the overall number of patients confirmed to have the disease has exceeded 7,50,178 in 144 countries. However, the number of infected people was probably much higher, and more than 36398 people have died from COVID-19. In a short period, the epidemic has spread to over 200 countries. When an infected person comes in contact with a normal individual or when the infected person sneezes or coughs, the virus that triggers COVID-19 spreads. Due to these reasons, the overall number of patients has increased to 425,659,334 around the world and 351,561,123 people have died as of 21st February 2022 (UMBERS 2020).1
This virus affects the lower and upper respiratory tract along with cough, fever, weakness, shortness of breath, and lack of taste and scent. Based on MERS (Middle East Respiratory Syndrome) and SARS (severe acute respiratory syndrome) incubation, the infected person develops signs within 2-14 days, and as a result of this patients are at high risk of death. To confront COVID-19's rapid spread most countries resorted to complete lockdown to control the outbreak of COVID-19 but unfortunately at a high human and economic cost.2
With the massive loss of humans and destructive economic impact, the second wave of the pandemic presents a looming threat to society. Unlike India, due to a good testing and tracing system South Korea was able to control the situation. Additionally, European countries also have faced the second wave which was worse than the first wave.3 Figure 1 showed the total active COVID-19 cases, total recovered and total death around the world.
Despite all the protective measures taken by all countries around the world, as of the time of writing this article (February 12, 2022), 425,659,334 total cases have been reported with 351,561,123 deaths. Hence, it showed that COVID-19 is highly contagious and spreads very rapidly. Most countries were lifting lockdown restrictions slowly and travel restrictions so the risk of new infectious cases was very high. Though the development of vaccines has reduced the chances of infection overpopulation, reduced medical facilities and fear of vaccines are some factors that will increase the chances of infection during the next waves. Hence, this paper aims to simulate the possible wave outbreak in countries around the world. Mathematical modeling is one of the efficient tools to study contagious disease spread, its persistence, or when the world will return to its earlier situation.4
Materials and Methods
This section will cover the detailed data set of the study, principal component analysis, Pearson Coefficient, Neural Network, and Polynomial Regression Curve fitting. In addition, these mathematical tools were also used to predict the third wave in India. So, this present study involves the COVID-19 prediction around the world with 195 countries in consideration. In the Web of Science collection database, the global literature regarding COVID-19 published between 2020 and 2023 was searched. The search terms "COVID-19," "Novel Coronavirus," "2019-nCoV," and "SARS-CoV-2" were used to find the pertinent publications. The articles' bibliometric analysis was carried out using a VOS viewer.
Dataset
This study contains the data sets of the number of COVID-19 patients, number of deaths, and total number of recovered patients due to this disease around the world till 12th February 2022.
Principal component analysis (PCA)
PCA is used for the conversion of a large dataset that is in the form of a multidimensional matrix into a smaller matrix for better computation. It removes the redundancy in the data and makes the data smaller by transforming the entire matrix into linearly uncorrelated variables. These variables are called principal components and they explain the variation in data. This leads to the removal of redundancy and the similar components of the dataset are grouped which can also be used to get insights from a huge multidimensional data.
Principal component analysis of algorithm
Step 1: The entire dataset containing p countries having the values for q dates is converted to a p × q matrix M.
Step 2: Calculation of the Eigen values and the covariance matrix Ω
where β for the ith value is defined as
Step 3: Calculate the Cumulative Explained Variance Ratio η(t) associated to the tth sample, where λ is the eigenvalue of the eigenvector e
Step 4: To convert the p×q matrix as p×t matrix, choose the Eigenvectors whose η(t) > 0.95 where t is the number of Eigenvectors chosen
Step 5 : Hence, the final reduced dataset in terms of principal components is represented as
where Et is the set of t Eigenvectors
In PCA, the extraction of the features (data points) is done based on the variance and the newer dataset obtained has a higher variance than the original dataset. This leads to the development of a far more compact nonredundant feature matrix that is more useful and requires less computation power. This is done by finishing the eigenvalues and the eigenvectors of the covariance matrix. The largest eigenvalues have the strongest correlation with the dataset and they are called the principal component.5
Neural network
The neural network is a supervised machine learning model which is based on minimizing the cost functions to get the best-fit curve for regression models. Generally used for classification models, Neural networks can also be trained to predict the numerical values directly and hence can be used for regression. The data was extracted from https://www.worldometers.info/coronavirus/ using Beautiful Soup by text mining. This data was fed into the neural network to predict the trend of the growth of the curve. The implementation of the neural network is done using the MLP Regressor of the sci-kit learn package of Python.5 To predict the trend of the COVID cases worldwide, a neural network of 500 hidden layers was used. The implementation was done using Anaconda in Python. For the neural network, the input data of cases from January 3, 2020, to February 12, 2022, was used.
Polynomial regression
Polynomial Regression is a technique of fitting the already available curve data with polynomials. A six-degree polynomial model is used in this paper and the trend of cases is fitted using the model. This gives us a prediction of how the data is going to change concerning time by properly optimizing the curve trend.6 A sample polynomial as shown in equation 1 is used and the already available data is fitted to it to obtain the values of the parameters viz. a, b, c, d, e, f, g.
Pearson correlation coefficient (PCC)
Correlation between sets of data is a measure of how well they are related. One of the most common measures of Correlation is the Pearson correlation coefficient which is the measure of linear correlation between two sets of data.6 It is the covariance of two variables, divided by the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, where the results are between −1 and +1.
Pearson's correlation algorithm
Pearson's correlation coefficient, when applied to a population, Greek letter ρ (rho) is commonly used to represent the population correlation coefficient or the population Pearson correlation coefficient. Given a pair of random variables (A, B), the formula for ρ is
Where:
cov is the covariance of the variables A and B
σa is the standard deviation of A
σb is the standard deviation of B
The formula for ρ can be expressed in terms of mean and expectation. Since
The formula for ρ can also be written as
Where :
σa and σb are defined as above
µa is the mean of A
µb is the mean of B
E is the expectation
The formula for the ρ can be expressed in terms of the uncentered moment. Since
The formula for ρ can also written as z
A bibliometric analysis
A statistical technique known as bibliometrics uses mathematical techniques to quantitatively analyses research papers that are concerned with a particular topic. Additionally, it may evaluate the major research fields, assess the quality of the studies, and forecast the course of future research. Nearly all significant research publications are included in the Web of Science (WOS) online database, which also offers built-in analysis tools to provide representative results. The co-occurrence, co-authorship, citation, bibliographic coupling, co-citation, and themes were examined using VOS viewer (version 1.6.10).7
Result
PCA for number of patients
Based on the number of patients (Figure 2 a) provides the results of PCA methods for the classification of countries. It is a 3D figure as three principal components were analyzed for the countries present in the dataset. Table 1 shows that the first principal component accounts for 97.1% of the variation in the data.
PCA for number of deaths
Table 2 shows the PCA analysis of the death data for 20 countries from 03/01/2020 to 12/02/2022. The eigenvalues are tabulated and the principal components are arranged in the hierarchical order based on the % variation of the data for which the variable accounts to.
Table 2
Principal Component Number |
Total average Eigenvalue |
Total average Percentage of Variance (%) |
Total average Cumulative (%) |
1-222 cases of different countries |
0.945945856
|
0.450450502
|
99.84036653
|
Neural network prediction
The parameters of the neural network are shown in Table 3. Figure 2a shows the 3D plot of the prediction of cases worldwide using the neural network and the polynomial regression model used.8
Table 3
Type of Framework |
Neural Network |
Alpha |
1e-6 |
Lear Ning |
0.0008 |
Number of hidden layers |
500 |
Batch size |
32 |
Number of Iterations |
19958 |
Tolerance |
1e-6 |
Activation Function |
RELU |
Optimizer |
Adam |
The data is taken for 506 days and the neural network with the parameters shown is applied and the R^2 value is calculated.
Polynomial regression
The R^2 values as depicted in Table 4 are obtained which indicate an almost perfect fit. Figure 4 shows the box plot of the prediction of the cases and the deaths worldwide with the normal distribution of the data using Polynomial regression.
Pearson’s correlation
Principal components analysis performed by doing transformations into some sets of variables values. The results were shown in the graphs where the countries and state axis were plotted in the first two component analysis and the variance was introduced. Principle component is shown by the variables such as total cases, total deaths and total recovered cases. The first component analysis showed the variability factor of nearly about 35%.9
Discussion
PCA for number of patients
According to the principal components, it can be seen that the countries can be divided into two groups. The first group consists of South Africa, Peru, Mexico, Brazil, India, Columbia, Russia and Argentina. The second group consists of the United States, Spain, Iran, Indonesia, France, Netherlands, United Kingdom, Germany, Poland, and Turkey.10 So, the countries in the same group may experience similar growth patterns if other conditions remain the same. Day-wise data has been taken and a 3-D graph has been drawn depending on the top three principal components. The more the magnitude of the eigenvalue, the more it accounts for the variation in the data. Chen C. et al. also studied the PCA for number of patient and concluded that there was significant difference between number of patients among the countries.11, 12
PCA for number of deaths
Table 2 shows that when we divide the countries into groups by Principal Components based on the total number of deaths, the first group consists of Ukraine, Poland, Turkey, India, Brazil, Germany, and Italy. The second group consists of Russia, Indonesia, Columbia, Argentina, France, Spain, Peru, Netherlands, Iran, the United States, the United Kingdom, Mexico, and South Africa. A significant difference between the countries’ groups can be seen in cases and deaths.
Neural network prediction
In the X axis of Figure 3, the days are present, and the Y and Z axis have the number of cases predicted by the neural network model and the polynomial regression model respectively. The Polynomial model predicts that the cases were increased for the next 100 days and the neural network model predicts a slow rise in the number of cases.
Figure 2b shows the 3D contour plot of the prediction of deaths worldwide for a period of 100 days. The deaths were predicted to increase with respect to rising days and reach to a maximum of 6888152 according to the Polynomial model and 4974181 according to the neural network model. So, it can be inferred that the neural network model predicts low deaths than the polynomial model for the worldwide deaths.13
Figure 3 show the trend of the curve-fitting data of the worldwide cases (Figure 3a) and deaths (Figure 3b) by neural network model. The Figure 4 shows the box plot of the prediction of the cases and the deaths worldwide with the normal distribution of the data using Neural network model as discussed above.
Polynomial regression
On implementing polynomial regression model using Python 3.8 and sci-kit learn module, satisfactory results were obtained. The R^2 values are shown in Table 4. For the world dataset, the equation is obtained to be f(X) = 461933.758 - 75264.227X^1 + 2132.323X^2 - 17.387X^3 + 0.083X^4 - 0.000X^5 + 0.000X^6 for the trend of cases with an R^2 value of 0.999. And for the trend of worldwide deaths, the equation comes out to be f(X) = 75895.500 - 9159.493X^1 + 211.012X^2 - 1.424X^3 + 0.005X^4 - 0.000X^5 + 0.000X^6 with an R^2of 0.999756857. Figure 3 show the trend of the curve-fitting data of the worldwide cases (Figure 3a) and deaths (Figure 3b) by the polynomial regression model.
Pearson’s correlation
Figure 5 depicts the heat map of Pearson correlation coefficients amongst 20 countries in the world. It can be inferred from the graph that the countries having the value of the Pearson correlation coefficient as less than 0.90 are comparatively less correlated with each other than the other having a value greater than 0.95. The nature of the pandemic is that the cases were increased everywhere and a correlation can be established easily due to the same trend of growth everywhere.14 So, the usual value of Pearson coefficient is greater than 0.50 cannot be applied here. As the Pearson correlation coefficient of United Kingdom and India is 0.86, it can be inferred that they are less correlated than the other counties that have above it 0.90 and even above 0.95 showing high correlation. This includes countries like France and India 0.93, Russia and France 0.98, etc. Some countries here even have a Pearson correlation coefficient of 1, for example, Italy and France and this depicts a great similarity in the trends in the number of cases in these countries suggesting adoption of similar measures.15 With passage of the time it can be considered the correlation matrix and found that the countries and states having confirmed cases were not uniform in the targeted places. So, it was observed that the eigenvalues for death were greater than the confirmed cases.
Countries, where the Pearson correlation coefficient is less than 0.90, are approximately less correlated with each other when compared to the countries with higher values. It can be seen that the Pearson coefficient of United Kingdom and India is 0.84. And hence, they are less correlated than the ones which higher values like France and Russia. Generally, a higher correlation corresponds to a value of more than 0.95 in this case when the cases are increasing all around the world. This includes countries like Brazil and USA 0.96, Russia and France 0.99, etc. There are some countries where the Pearson correlation coefficient is 1 example Netherlands and Brazil that means the trend of deaths in these countries are similar and similar growth can be expected. The number is showing how the countries are correlated with each other in terms of deaths.16
Case Study of India
In India, when a student returned back from Wuhan, the province of China the COVID-19 was first confirmed in the state Kerala on January 27, 2020. To prevent the possibility of a stage 3 human to human transmission that can stimulate the spread of the coronavirus disease the Government of India has incorporated social distancing as a precautionary measure. In addition, to make aware the people about the peculiar epidemiological traits compared with previous two epidemics of SERS-CoV and MERS- CoV Indian Government also imposed a 14 hours voluntary public curfew (Janta Curfew) on March 22, 2020. However, a 21 days nationwide lockdown from March 25, 2020 to April 14, 2020 has declared by Government of India to prevent the spread of coronavirus disease among human (1.3 billion India population).17
To combat against COVID-19 pandemic in India in second wave, the lockdown has been extended up to May 03, 2020. This was further extended to 17th May 2020 by the Government of India and then NDMA finally extended this to 31st May 2020. Finally, from 8 June 2020, services began to resume in the garb of “Unlock 1”. But as the restrictions were eased later due to the decrease in the number of cases, people came out in large numbers and the policies of the government supported them. In October 2020, a model suggested that the COVID had peaked in India and Indians had achieved herd immunity. Despite India’s coronavirus numbers cross 1 crore-mark on December 19, people did not care because of rumors and being weary of staying inside their houses. This led to carelessness by the people only for the second wave to bounce back higher bringing the tally of the total cases to more than 30 million. Figure 7 shows the total number of cases, deaths and recovered cases due to COVID-19 in India states as on 27 June 2021. During the period of second wave, there was no proper vaccination or healthcare. But later, there were several drugs have been researched for example – by DRDO and other countries have also prepared medicines for COVID.18 And as a result of proper healthcare and large-scale vaccinations, the death rate has decreased and the recovery rate is higher. But there were some gullible people who were afraid of the vaccines due to the rumors spread with respect to the side effects of the vaccines. Hence, due to the effect of several factors, the third wave was predicted to have a lower number of cases when compared to the second wave but still was higher than the first wave as the growth rate was increasing steadily.19
PCA
The PCA data of number of cases in India which implies that the first principal component accounts for 95.77% of the variation in the data. According to the principal components, it can be seen that the states can be divided into four groups as given the Table 5. Further, the states are divided into four groups shown in the Table 6. According to the principal components, the states can be divided into four groups as shown in Table 7.
Table 5
Table 6
Table 7
Polynomial regression
The polynomial regression model was carried out on the data of daily Indian cases and daily Indian deaths.19 The equations are f(X) = -1622721.034 + 186954.462X^1 - 4354.908X^2 + 38.021X^3 - 0.144X^4 + 0.000X^5 - 0.000X^6 for the number of cases and f(X) = -11137.405 + 1355.512X^1 - 32.661X^2 + 0.294X^3 - 0.001X^4 + 0.000X^5 - 0.000X^6 for the number of deaths.20 Figure 6 shows the box plot of the prediction of the cases and the deaths in India with the normal distribution of the data using Polynomial regression and Neural Network model.
Pearson’s correlation coefficient
As depicted in Figure 7, the number of cases and treads in different Indian states are plotted as a numeric heat map of Pearson coefficients. In India, there has been a general increase in the number of cases because of the super spreader nature of the virus.21 So, high correlation coefficients are expected because the cases were increasing everywhere. But, even amongst the high correlation coefficient, some important correlations can be established by taking geographic and demographic data into consideration.22 Usually, a coefficient is more than 0.5 considered having a high correlation,23 but it can be seen here that all the coefficients are above 0.80. It can be said that a correlation of less than 0.90, for example Andaman and Nicobar Island and Punjab having their Pearson correlation coefficient as 0.87 have a comparatively low correlation. Generally, a coefficient above 0.95 can be considered as high similarity for this dataset because of its intrinsic nature of it being increasing everywhere.23 Some states exhibit a high coefficient like Punjab and Odisha 0.97, Telangana and Puducherry 0.95, etc. And in some cases, the value is equal to 1 like in the case of Uttarakhand and Rajasthan indicating exactly same trend and high correlation.
It can be inferred that the states/union territories, where the Pearson correlation coefficient is less than 0.90, are comparatively less correlated with each other. As an example, if we consider the pair of Tripura and Uttarakhand, their Pearson correlation coefficient for the number of deaths is 0.81. And this means high correlation usually, but with the other state pairs having values more than 0.95 usually, this may seem a bit low. And the Pearson coefficient is meant to have a high value because there is usually migration between the states and the trend of increase and decrease of the cases among different states is similar. Some pairs like Tamil Nadu and Puducherry (0.99), Telangana and Odisha (0.96) have high correlation. And this can be geographically proved as well because these states are near each other and have similar demographics. Some states/union territories have Pearson correlation coefficient of 1, for example, Odisha and West Bengal.10 That means they can have approximately the similar trend of deaths.
States/Union territories, where the Pearson correlation coefficient is less than 0.90, are comparatively less correlated with each other than the ones with a coefficient of more than 0.95. For example, states/union territories like the Andaman and Nicobar Island and Punjab their Pearson correlation coefficient as 0.87. So, they are less correlated than the other States/Union territories that have above 0.90 and even above 0.95 Pearson correlation coefficient which depicts that they are highly correlated to each other.24
This includes States/Union territories like Punjab and Odisha having 0.97, Telangana and Puducherry having a coefficient of 0.95, etc. Some states/union territories have Pearson correlation coefficient of 1, for example, Uttarakhand and Rajasthan. That means they are strongly correlated with similar increase/decrease in the number of confirmed cases. Here we observed the largest value eigenvalues and their corresponding eigenvectors. We observed that the largest eigenvalues of countries are same to the state of disease. The countries and states that were having largest eigenvalues from the start of pandemic (April 2020) decreased gradually (September 2020). After the peak time it gradually started increasing with little fluctuations in their values (January 2021). Again next largest eigenvalues were increased from September 2021 to October 2021, by following the trend of the first largest eigenvalues. After the peak time it gradually increases. Death cases were reliable till January 2022.25
Bibliometric analysis of the keywords
The final analysis included keywords that were submitted by the paper's authors and appeared more than five times in the WOS core database. 4348 of the 9,886 keywords met the requirement. The keywords with the highest frequency were "COVID-19" and "coronavirus", which were strongly associated with "pneumonia" and "epidemiology". To display the frequency of the keywords that appeared more than ten times, a word cloud was also constructed.26 The most prevalent condition was listed as "COVID-19," followed by "pneumonia," "outbreak," and “infection. Figure 8 shows the bibliometric analysis of the keyword used and annual scientific Production for Covid. By using bibliometric analysis, six clusters of the cited references were discovered. Six clusters were displayed in various color schemes. The most often referenced author in the red cluster is Bierman I. The same information was utilized to analyses co-authorship by country. Data on a strong network of international collaboration was filtered and collected using simple criteria. Just 93 of the 195 nations met the requirements as a result.27 Figure 8 also displays relationship of the co- citations of authors and inferred that the top three clusters, which are colored red, green, and blue, stand in for the research areas of clinical characteristic, disease transmission, and treatment and also depicts the pattern of the countries' corresponding authors, with USA having the strongest influence on the research with the greatest number of publications in a single nation, followed by the China and then India.
Conclusion
In this paper, Principal component analysis, Polynomial Regression, and Neural Networks were used to predict the trend of COVID-19 cases both worldwide and for India. Considering the rate of deaths and cases in different countries, similar policies can be adopted for the countries in the same group. A case study pertaining to India has been carried out and the modelling algorithms have been implementing leading to significant results. R^2 values > 0.999 have been obtained for the curve fitting data and the trends for the next 100 days have been predicted. With consideration of 222 countries, the number of cases and deaths in the world is predicted to increase and taking immediate preventive measures are suggested. The Pearson correlation study shows high correlation between the data as the average value of the coefficients is more than 0.99. This study has been performed both for Indian states and the worldwide countries which show a similar result, saying that the reasons for increase in the cases in one state are linked with the other. A special case study on India has been performed and the data from 36 Indian states and Union Territories have been studied. The cases and deaths are predicted to steadily increase in the forthcoming 100 day according to the Neural network model and it can be noted that the polynomial regression model predicts a peak on the number of deaths in India on 13th January, 2022. Hence, preventive measures are recommended with strengthening of medical facilities. Bibliometric analysis concluded that more and more scholarly papers are being produced as the pandemic spreads. Understanding COVID-19 and developing strategies to stop its spread are both made possible by scientific and medical research. Future directions still include the development of vaccines and effective pharmacological therapies.