COVID 19 STUDY OF STATE OF KERALA USING DATA SCIENCE AS ON 08-05-2020

Software used for this research work is

Anaconda Individual version available at https://www.anaconda.com/

Python 3.0 available at https://www.python.org/

Datasets obtained from

https://www.kaggle.com/sudalairajkumar/covid19-in-india.

A very special thanks to kaggle.com for the day to day datasets provided on their websites. The good part of this data obtained from Kaggle.com is clean, it does not have any missing data so it is easy to work with the data.

The updated dataset 8/5/2020 8:00 am

I have also used Google's Colab research resource https://colab.research.google.com/ which is extremely useful

Processor used Intel i3 4 GB memory

I have loaded the India Covid cvs , datasets into Python and formed the df data frame.

print(df)

The number of rows are 1737 and 9 columns it's quite an extensive database enough for our study of the spread of Covid in India. Note the columns we have an Index starting from 0 to 1736 , the other columns are Sno , date , Time ... Cured , Deaths and Confirmed
Let enumerate the entire column names of this database
list(df.columns)

We will print the first five rows
print(df.head())

Now we will list the last five rows

print(df.tail())

Print the data types used
print(df.dtypes)

Note that Date is in dtype object while Cured , Deaths , Confirmed and Deaths are in int64.

Let's obtain the entire information of the Panda data frame by using 

print(df.info()
We are interested only in Date , State/UnionTerritory , Confirmed , Cured and Deaths. 

The other information being of no use.

For the sake of simplicity, lets filter the data with only these columns and name the data set for them as a subset

subset=df[['State/UnionTerritory','Date','Confirmed','Cured','Deaths']]
print(subset)
You can see that this data includes all states not in alphabetical order but date wise with Kerala starting on 30/01/2020 and 
up to West Bengal 8/5/2020.

We rename the column heading for our convenience and ease of usage for futher 

data analysis. We call the dataframe as df1

df1 = df.rename(columns = {'Sno': 'Sno'  , 

                                 'Date':'Date',
                                 'Time':'Time',
                                 'State/UnionTerritory': 'STUT',
                                 'ConfirmedIndianNational':'CInNat',
                                 'ConfirmedForeignNational':'CFnNat',
                                 'Cured':'Cured',
                                 'Deaths':'Deaths',
                                 'Confirmed':'Confirmed'})
df1.head()

The output is as follows

The columns have been renamed according to our convenience
We the first for our study the state of Kerala which has reported the first case and has had a very good rate of recovery and also the number of confirmed cases has come down substantially

is_subset1_kerala=subset1.STUT == "Kerala"
subset1[is_subset1_kerala]
You can see that there are 100 rows and 5 columns in this dataframe.
We call this data frame as dfKerala
We plot the regression line of Confirmed vs Cured for the state of Kerala and we get a graph as below
X = dfKerala.drop('Confirmed',axis = 1)
y = dfKerala[['Confirmed']]
seed = 10
test_data_size = 0.3 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_data_size, random_state = seed) 
train_data = pd.concat([X_train, y_train], axis = 1) 
test_data = pd.concat([X_test, y_test], axis = 1)
X = dfKerala.drop('Confirmed',axis = 1)
y = dfKerala[['Confirmed']]
fig, ax = plt.subplots(figsize=(12, 6))
sns.regplot(x='Confirmed', y='Cured', ci=None, data=train_data, ax=ax, color='k', scatter_kws={"s": 20,"color":"red", "alpha":1})

fig, ax = plt.subplots(figsize=(12, 6)) 
y = np.log(train_data['Confirmed'])
sns.regplot(x='Cured', y=y, ci=95, data=train_data, ax=ax, color='k', scatter_kws={"s": 10,"color": "royalblue", "alpha":1})
ax.set_ylabel('log of Confirmed', fontsize=15,fontname='DejaVu Sans') 
ax.set_xlabel("Cured",fontsize=15, fontname='DejaVu Sans') 
ax.set_xlim(left=None, right=None) 
ax.set_ylim(bottom=None, top=None) 
ax.tick_params(axis='both', which='major', labelsize=12) 
fig.tight_layout()
seed = 10
test_data_size = 0.3 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_data_size, random_state = seed) 
train_data = pd.concat([X_train, y_train], axis = 1) 
test_data = pd.concat([X_test, y_test], axis = 1) 
We have used a test data and a train data with seed value 10 which is standards for this linear Model
Now we plot the linear model for this data
fig, ax = plt.subplots(figsize=(12, 6))
sns.regplot(x='Confirmed', y='Deaths', ci=None, data=train_data, ax=ax, color='k', scatter_kws={"s": 20,"color":"royalblue", "alpha":1})
The output is as follows
From the graph we can see that the blue coloured points are very close to the regression line which indicates that the number of 
cured to confirmed follows a linear relation. 

One conclusion which can be drawn is that, the Kerala  is having more cured cases and the number of confirmed cases are lowering.
If we need a logarithmic linear regression graph for the state of Kerala then we have the following code for the same
fig, ax = plt.subplots(figsize=(12, 6)) 
y = np.log(train_data['Confirmed'])
sns.regplot(x='Deaths', y=y, ci=95, data=train_data, ax=ax, color='k', scatter_kws={"s": 10,"color": "royalblue", "alpha":1})
ax.set_ylabel('log of Confirmed', fontsize=15,fontname='DejaVu Sans') 
ax.set_xlabel("Deaths",fontsize=15, fontname='DejaVu Sans') 
ax.set_xlim(left=None, right=None) 
ax.set_ylim(bottom=None, top=None) 
ax.tick_params(axis='both', which='major', labelsize=12) 
fig.tight_layout()
Again the same conclusion can be drawn about this linear regression model for the State of Kerala
Let us now look at the heatmap for the correlation between the variables 
considered for this data
For this Correlation matrix , I have used the Pearsons coeeficient
The Python codes for this are 
corrMatrix = train_data.corr(method = 'pearson') 
xnames=list(train_data.columns) 
ynames=list(train_data.columns) 
plot_corr(corrMatrix, xnames=xnames, ynames=ynames,title=None,normcolor=False, cmap='RdYlBu_r')
The heatmap for the State of Kerala are as follows

The red area show remarbly high degree of correlation which is Confirmed v Cured is  0.92 to 0.94 which indicating a high 
positive correlation coefficient for the State of Kerala as of 08-05-2020. A conclusion which can be drawn is recovery rate is 
high for this state.
We can verify the same by obtaining the correlation coefficients using Pearsons method by using the codes
train_data.corr (method = 'pearson')
The State of Kerala has shown remarkable progress in lowering the number of Deaths , Cured Cases being very high and also the 
number of Confirmed cases is low. 



 

Search This Blog

COVID19 - DATA SCIENCE ANALYSIS

COVID 19 STUDY OF STATE OF KERALA USING DATA SCIENCE AS ON 08-05-2020

Comments

Post a Comment