COVID 19 STUDY OF STATE OF KERALA USING DATA SCIENCE AS ON 08-05-2020
Software used for this research work is
Anaconda Individual version available at https://www.anaconda.com/
Python 3.0 available at https://www.python.org/
Datasets obtained from
https://www.kaggle.com/sudalairajkumar/covid19-in-india.
A very special thanks to kaggle.com for the day to day datasets provided on their websites. The good part of this data obtained from Kaggle.com is clean, it does not have any missing data so it is easy to work with the data.
The updated dataset 8/5/2020 8:00 am
I have also used Google's Colab research resource https://colab.research.google.com/ which is extremely useful
Processor used Intel i3 4 GB memory
I have loaded the India Covid cvs , datasets into Python and formed the df data frame.
Anaconda Individual version available at https://www.anaconda.com/
Python 3.0 available at https://www.python.org/
Datasets obtained from
https://www.kaggle.com/sudalairajkumar/covid19-in-india.
A very special thanks to kaggle.com for the day to day datasets provided on their websites. The good part of this data obtained from Kaggle.com is clean, it does not have any missing data so it is easy to work with the data.
The updated dataset 8/5/2020 8:00 am
I have also used Google's Colab research resource https://colab.research.google.com/ which is extremely useful
Processor used Intel i3 4 GB memory
I have loaded the India Covid cvs , datasets into Python and formed the df data frame.
print(df)
The number of rows are 1737 and 9 columns it's quite an extensive database enough for our study of the spread of Covid in India. Note the columns we have an Index starting from 0 to 1736 , the other columns are Sno , date , Time ... Cured , Deaths and Confirmed
Let enumerate the entire column names of this database
list(df.columns)
We will print the first five rows
print(df.head())
Now we will list the last five rows
print(df.tail())
Print the data types used
print(df.dtypes)
Let's obtain the entire information of the Panda data frame by using
print(df.info()
We are interested only in Date , State/UnionTerritory , Confirmed , Cured and Deaths.
The other information being of no use.
For the sake of simplicity, lets filter the data with only these columns and name the data set for them as a subset
We rename the column heading for our convenience and ease of usage for futher
data analysis. We call the dataframe as df1
df1 = df.rename(columns = {'Sno': 'Sno' ,
'Date':'Date',
'Time':'Time',
'State/UnionTerritory': 'STUT',
'ConfirmedIndianNational':'CInNat',
'ConfirmedForeignNational':'CFnNat',
'Cured':'Cured',
'Deaths':'Deaths',
'Confirmed':'Confirmed'})
df1.head()
The output is as follows
The columns have been renamed according to our convenienceWe the first for our study the state of Kerala which has reported the first case and has had a very good rate of recovery and also the number of confirmed cases has come down substantially
is_subset1_kerala=subset1.STUT == "Kerala"
subset1[is_subset1_kerala]
You can see that there are 100 rows and 5 columns in this dataframe.
We call this data frame as dfKerala
We plot the regression line of Confirmed vs Cured for the state of Kerala and we get a graph as below
X = dfKerala.drop('Confirmed',axis = 1)
y = dfKerala[['Confirmed']]
seed = 10
test_data_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_data_size, random_state = seed)
train_data = pd.concat([X_train, y_train], axis = 1)
test_data = pd.concat([X_test, y_test], axis = 1)
X = dfKerala.drop('Confirmed',axis = 1)
y = dfKerala[['Confirmed']]
fig, ax = plt.subplots(figsize=(12, 6))
sns.regplot(x='Confirmed', y='Cured', ci=None, data=train_data, ax=ax, color='k', scatter_kws={"s": 20,"color":"red", "alpha":1})
fig, ax = plt.subplots(figsize=(12, 6))
y = np.log(train_data['Confirmed'])
sns.regplot(x='Cured', y=y, ci=95, data=train_data, ax=ax, color='k', scatter_kws={"s": 10,"color": "royalblue", "alpha":1})
ax.set_ylabel('log of Confirmed', fontsize=15,fontname='DejaVu Sans')
ax.set_xlabel("Cured",fontsize=15, fontname='DejaVu Sans')
ax.set_xlim(left=None, right=None)
ax.set_ylim(bottom=None, top=None)
ax.tick_params(axis='both', which='major', labelsize=12)
fig.tight_layout()
seed = 10
test_data_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_data_size, random_state = seed)
train_data = pd.concat([X_train, y_train], axis = 1)
test_data = pd.concat([X_test, y_test], axis = 1)
We have used a test data and a train data with seed value 10 which is standards for this linear Model
Now we plot the linear model for this data
fig, ax = plt.subplots(figsize=(12, 6))
sns.regplot(x='Confirmed', y='Deaths', ci=None, data=train_data, ax=ax, color='k', scatter_kws={"s": 20,"color":"royalblue", "alpha":1})
The output is as follows
From the graph we can see that the blue coloured points are very close to the regression line which indicates that the number of
cured to confirmed follows a linear relation.
One conclusion which can be drawn is that, the Kerala is having more cured cases and the number of confirmed cases are lowering.
If we need a logarithmic linear regression graph for the state of Kerala then we have the following code for the same
fig, ax = plt.subplots(figsize=(12, 6))
y = np.log(train_data['Confirmed'])
sns.regplot(x='Deaths', y=y, ci=95, data=train_data, ax=ax, color='k', scatter_kws={"s": 10,"color": "royalblue", "alpha":1})
ax.set_ylabel('log of Confirmed', fontsize=15,fontname='DejaVu Sans')
ax.set_xlabel("Deaths",fontsize=15, fontname='DejaVu Sans')
ax.set_xlim(left=None, right=None)
ax.set_ylim(bottom=None, top=None)
ax.tick_params(axis='both', which='major', labelsize=12)
fig.tight_layout()
Again the same conclusion can be drawn about this linear regression model for the State of Kerala
Let us now look at the heatmap for the correlation between the variables
considered for this data
For this Correlation matrix , I have used the Pearsons coeeficient
The Python codes for this are
corrMatrix = train_data.corr(method = 'pearson')
xnames=list(train_data.columns)
ynames=list(train_data.columns)
plot_corr(corrMatrix, xnames=xnames, ynames=ynames,title=None,normcolor=False, cmap='RdYlBu_r')
The heatmap for the State of Kerala are as follows
The red area show remarbly high degree of correlation which is Confirmed v Cured is 0.92 to 0.94 which indicating a high
positive correlation coefficient for the State of Kerala as of 08-05-2020. A conclusion which can be drawn is recovery rate is
high for this state.
We can verify the same by obtaining the correlation coefficients using Pearsons method by using the codes
train_data.corr (method = 'pearson')
The State of Kerala has shown remarkable progress in lowering the number of Deaths , Cured Cases being very high and also the
number of Confirmed cases is low.
















Comments
Post a Comment