In: Economics
Suppose you are commissioned by the CDC to investigate the role of some key socio-economic factors that may be impacting the death rate due to COVID-19. What will be a Regression Model that you can express. Please write that model and explain why the variables you chose may be of any interest. How would you perform any Hypothetical test to validate your arguments?
The model was executed as a three-step strategy. Firstly, in order to visualise a base mortality risk assessment (or pre-COVID mortality risk scenario) a multi-criteria Analytic Hierarchy Process (AHP) [31] was used to compute weights (relative importance) for the nine static indicators. The pair-wise comparison in AHP is a common technique to assess the significance of each indicator [32] with a tolerable degree of inconsistency in each pairwise comparison [33]. The first and second author independently evaluated the relative importance of the factors and the discrepancies were accordingly resolved. The relative importance of weights was handpicked in accordance with the analysis of research literature, which has established the impact of various indicators on COVID-19 mortality [26]. The computed weights are summarised in a table (see Table 1). The baseline scenario represented the health risk in general terms without focusing on the COVID-19 pandemic. It showed the strength of each nation based on their economy, health infrastructure, and demography. Secondly, a multivariate linear regression model was conducted where the dependent variable was a normalised COVID-19 mortality for a country as of 13 May 2020. The independent variables were the nine static socio-economic factors described earlier. Thirdly, the regression model was repeated as mentioned in the second step but this time with the on-top addition of the six dynamic factors associated with COVID-19, giving a total of 15 independent variables. The third scenario that included COVID-19 related data alongside stringency data and static variables provided a reflection of the current pandemic state of the world.
Table 1
Base scenario weights of static factors.
Variable | Weight |
---|---|
Average Population Density | 0.027 |
Population | 0.039 |
Health Expenditure | 0.058 |
GDP | 0.09 |
DALY | 0.157 |
Nurses | 0.157 |
Physicians | 0.157 |
Hospital Beds | 0.157 |
A65abp | 0.157 |
Consistency Ratio < 0.01 |
For the regression models, the regression predictors were then assessed for relative importance via assigning of weights using the relaimpo package [34]. Lastly, the weights obtained from the modelling were aggregated with their ranks in the form of a weighted sum (see Equation (1)):
Riski=∑j=1nwjaij
(1)
where, w = weight, a = rank value, i represents each country, and j represents each factor value of ith country.
The second stage of the analysis was a linear regression model using nine static variables as independent factors and a COVID-19 normalised mortality on 13 May 2020 as the dependent variable. The results are referred to in a table (see Table 2). R2 was such that it could explain 69% variance in the entire dataset. The ratio of the elderly in the population (or A65abp) emerged as a significant predictor. The GDP of countries and number of hospital beds were nearing that significance. Consequently, these predictors were also assigned higher relative weights by the relampo package in R; 19% for A65abp and 22% for GDP. As a means to check multicollinearity, the variance inflation factor for all predictor variables was lower than nine.
Table 2
Regression results for risk of mortality where for p-values “***” represents p<0.001 m, “**” represents p<0.01, and “*” represents p<0.05.
Regression Model | R2 | Significant p-Values |
Top Weights |
---|---|---|---|
Static factors | 0.69 | A65abp *** | A65abp (0.19), GDP (0.22) |
Static and dynamic factors | 0.88 | A65abp ***, nurses *, susceptible *, active ***, mortality growth ** |
active (0.20), susceptibles (0.15), mortality growth (0.11), A65abp (0.10) |
For the third step in the analysis, the regression modelling was repeated with the addition of six dynamic variables associated with COVID-19, giving a total of 15 independent variables. Then, the model was able to explain up to 88% variance in the data. The dynamic variables tended to heavily dominate over the static socio-economic factors with three dynamic factors having significant predictive power. The ratio of the elderly was yet again a significant predictor towards COVID-19 mortality risk as was the number of nurses. Furthermore, the government-enforced stringency level did not emerge as a significant predictor in this model. As a means to check multicollinearity, the variance inflation factor for all predictor variables (except GDP) was lower than 7. A table (see Table 3) is presented that summarises the top 10 countries sorted on the basis of the current mortality risk and their predicted risk ranking (both pre-COVID and on 13 May 2020) and the latter using the modelling analysis comprising of both static and dynamic indicators. The table shows that at least for this subset of 10 countries, they are at a COVID-19 mortality risk level where they were anticipated to be consistent with their baseline risk assessment.
Table 3
Top 10 countries ranked on actual mortality rate and their predicted risk assessment.
Country Name | Mortality Rate (Actual) |
Pre-COVID-19 Mortality Risk Rank (Predicted) |
COVID-19 Mortality Risk Rank as at 13 May 2020 (Predicted) |
---|---|---|---|
San Marino | 1213.6 | 41 | 3 |
Belgium | 774.2 | 7 | 8 |
Andorra | 636.3 | 46 | 60 |
Spain | 580.1 | 35 | 41 |
Italy | 514.1 | 14 | 17 |
United Kingdom | 499.1 | 25 | 16 |
France | 403.5 | 11 | 13 |
Sweden | 339.8 | 9 | 11 |
Netherlands | 322.8 | 10 | 12 |
Ireland | 308.4 | 27 | 33 |
A spatial map illustrates the mortality risk of COVID-19 as predicted by the third step of the analysis (see Figure 2). A spatial map was also drawn based on the change from baseline in COVID-19 mortality risk as projected from the linear regression modelling technique, which used a conglomerate of both static and dynamic factors (see Figure 3), essentially a difference between Figure 2 and Figure 3). The map clearly indicates that most countries were at a level of expected risk or lower risk on 13 May 2020 compared with what was originally predicted in the base scenario (noting how most countries are coloured in shades of yellow, orange, or green, which refers to a reduction or equivalence in risk from what was expected). All materials related to the modelling such as R code, output and base data is provided in the form of a supplementary file.
Conclusions
In this paper, a mortality risk-based evaluation of COVID-19 on a global scale using data as at 13 May 2020 is presented. Using a multi-weighted approach, a range of unique scenarios using a mixture of static and dynamic variables were incorporated. The main finding was that the ratio of the elderly in a population clearly emerged as a significant mortality risk predictor for COVID-19, however this must be considered in light of the residency makeup of individual countries. In addition, a conglomerate of static socio-economic factors and dynamic factors associated with COVID-19 growth and spread had higher predictive capability. The current stringency of government-imposed restrictions was also not observed to have an impact. In general, as on 13 May 2020, from a spatial perspective the current mortality risk projections of COVID-19 may be considered as lower or as expected for most countries around the world.
The earliest Covid-19 patients were recorded in the data set on January 22, 2020. We have taken examples from January 22, 2020 to June 29, 2020. It consists of 160 instances and five attributes. These attributes have information about the date of recording, confirmed cases, recovered cases, deaths, and growth rates related to CoViD-19 patients. The following estimates are made from the data set to explore and extract useful information.
Correlation coefficients
The statistical measure correlation coefficient is the strength of the relationship between the relative motions of two variables. The range is defined as -1 to +1. Incorrect correlation measurement occurs when values greater than +1 and less than -1. The correlation measurement at -1 is completely negative, the correlation measurement at +1 is positive, and the value at 0.0 is the nonlinear relationship between the two variables [24].
Related statistics can be used to define the relationship between different attributes of the disease. A correlation coefficient can be calculated to determine the correlation level between the confirmed cases and the recovered cases under the current pandemic situation and the rate of increase in deaths and mortality, as shown in Table 1 and Figure 3. We found that in Covid-19 confirmed case and recovered case the correlation between these two variables is highly positive.
Table 1: Correlation Coefficients of attributes
Confirmed |
Recovered |
Deaths |
Increase rate |
|
Confirmed |
1.000000 |
0.986051 |
0.988177 |
-0.378478 |
Recovered |
0.986051 |
1.000000 |
0.950569 |
-0.337027 |
Deaths |
0.988177 |
0.950569 |
1.000000 |
-0.401742 |
Increase rate |
-0.378478 |
-0.337027 |
-0.401742 |
1.000000 |
ARIMA Model Results
In the ARIMA model, we choose the parameters p, d, q [28]. For this reason, even without drawing graphics, we use auro_arima to find the appropriate parameters. The auro_arima work works by directing differencing tests like Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips– Perron to decide the request for differencing, d, and afterward fitting models inside scopes of characterized start_p, max_p, start_q, max_q ranges [25]. In the event that the occasional discretionary is empowered, auto_arima likewise tries to distinguish the ideal P and Q hyper-boundaries in the wake of directing the Canova-Hansen to decide the ideal request of occasional differencing, D. The following figure 4 shows the parameters obtained by the auro_arima model.
When viewing the residual plot from the auto_arima model, as shown in Figure 5.
The output of the auto_arema model is explained as follows:
Standardized residual: The error of the residual is near the mean of the zero line and has a uniform variance.
Histogram and density plot: In the figure below, the density plot shows the equal distribution around the zero line average.
QQ-plot: In the QQ chart, all blue dots (ordered distribution of residuals) are on the red line, and any deviations will be skewed by the line. It is usually distributed along N (0, 1) and is considered to be uniformly distributed.
Correlogram: Correlogram or ACF plots show that the residual error isn't autocorrelated. Any autocorrelation implies that Residual error.
The optimal values of p, d, and q obtained by the auto_arima model are 1, 2, and 2, respectively. Now, using the best parameters obtained (1, 2, 2) to create an ARIMA model, the results are shown in figure 6.
Figure 6 above shows the importance of the ARIMA model. In this figure, we will focus on the coefficient table. The coef section shows the weight of each element and how each element affects the time series. P> | z | this section provides advice on the importance of the weight of each element. Here, the p-value of each weight is less than or close to 0.05, so it is wise to include each weight in our model.
These views make us think that our model can create a good fit, which can help us understand time series information and calculate future value. Although we have a reasonable fit, we can occasionally change some limitations of the ARIMA model to improve the model's aggressiveness. We have obtained a model for the time series and can now use it to create estimates [26]. We first compare the predicted value with the actual estimated value of the time series, which will help us understand the accuracy of the prediction. The numbers and associated confidence intervals we have now created can now be used to additionally understand time series and predict what to store. Our data shows that relying on time series can maintain a consistent growth rate.
As our predictions for the future say, it is normal to be less optimistic about our values. This is reflected by the deterministic interval generated by our model, as we further develop, the deterministic interval will become larger and larger. We start predicting death cases in a test data set that maintains 95% confidence. Figure 7 below shows the prediction results.
In the figure below, the actual death of the training data set is shown by the blue line, and the predicted death is shown by the red line. The prediction of death on the red line has dropped, which means that in the future, the incidence of deaths will become shorter and shorter, as more and more people recovered quickly, and people maintained the social distance in this pandemic situation.
By using statistical data, we created summary metrics that classify and collect residuals into single value, which are related to the model's a predictive ability.
In order to judge the prediction results, let us apply commonly used accuracy indicators, the results are shown in table 2.
Table 2: Correlation Coefficients of attributes
Measures of Accuracy |
Value |
Mean Absolute Error (MAE) |
0.12044588473307338 |
Mean Squared Error (MSE) |
0.023012953284359018 |
Root Mean Squared Error (RMSE) |
0.15170020858376898 |
Mean Absolute Percentage Error (MAPE) |
0.009196691386663233 |
The MAE of our model is 0.1204, which is quite small suppose our data death case starts at 0.01.
For MSE, the value 0.0230 is less than MAE. We found this to be the case: MSE is an order of magnitude smaller than MAE.
The value 0.1517, of RMSE is similar to standard deviation and is a measure of how much the residual distribution is.
Around 0.91% MAPE implies the model is about 99.09% accurate in predicting the test set observations.
Regression Model Results
In order to find out which factor has the most significant influence on the forecasted output and how the various factors identify each other, we will consider different input functions such as "confirmation case", "recovered case" and "increase rate". Based on these characteristics, we will predict the deaths of Covid-19 patients. The data set splited into 80%:20% as training and testing respectively.
In multiple linear regression, then regression the model has selected the best coefficients for all attributes [27]. The coefficients of the regression model are shown in Table 3 below.
Table 3: coefficients of regression model
Attributes |
Coefficient |
Confirmed |
0.103305 |
Recovered |
-0.100568 |
Increase rate |
69.616876 |
From the table 3, it is clear that if increase in “recovered case” by 1 unit, there is decrease of “death case” by 0.1005 units vice versa. Similarly, increase in “confirmed case” and “increase rate” by 1 unit, there is increase in “death case” by 0.1033 units and 69.6168 units respectively.
Now we predict the test data to check the difference between the actual value and the predicted value in Table 4 below.
Table 4: Difference between the actual value and predicted value
Instance Number |
Actual Value |
Predicted Value |
110 |
286697 |
221975.301362 |
112 |
297539 |
286646.565236 |
143 |
430047 |
423127.482077 |
7 |
133 |
-6528.684075 |
44 |
3459 |
-2713.950271 |
101 |
244129 |
236968.993751 |
122 |
342565 |
329894.990367 |
66 |
31990 |
47224.597929 |
85 |
148157 |
160515.287829 |
86 |
157022 |
167041.159151 |
133 |
386298 |
376198.729391 |
92 |
193926 |
198189.689192 |
26 |
1868 |
-1385.556916 |
146 |
443685 |
438945.896459 |
119 |
328483 |
318945.015040 |
62 |
19026 |
25233.066196 |
51 |
5411 |
808.770349 |
97 |
221109 |
221511.564448 |
128 |
365380 |
355638.073651 |
90 |
180475 |
187102.115303 |
When plotting and comparing the actual value and the predicted value, as shown in Figure 8.
As shown in the multiple regression model shown in Table 4 and Figure 8, the initial predicted number of deaths has increased compared with actual deaths, but as we progress in the data table, compared with actual deaths, the predicted deaths the number has decreased from the month of May 2nd 2020.
Overall, this study shows that the reduction in deaths worldwide is a good sign for human society.