CLUSTERING STUDY OF COVID CASES INDIA AND FOR MAHARASHTRA, GUJARAT , KERALA , KARNATAKA UPDATED 09

Clustering algorithms are very popular in the data science industry for grouping similar data points and detecting outliers.

Clustering analysis performed on data would uncover natural patterns by grouping similar data points.

We will analyse the COVID19INDIA data using clustering techniques, one of the more popular techniques being Clustering with k-means.

k-means is one of the most popular clustering algorithms (if not the most popular) among data scientists due to its simplicity and high performance.

ts origins date back as early as 1956, when a famous mathematician named Hugo Steinhaus laid its foundations, but it was a decade later that another researcher called James MacQueen named this approach k-means.

The objective of k-means is to group similar data points (or observations) together that will form a cluster which is done automatically for you from the data.

I have applied the k-means algorithm to find a cluster for India and then to a few states Maharashtra, Gujarat which has still not recovered from the pandemic and also those states which are on the way to becoming COVID free like Kerala and Karnataka.

The entire India dataframe updated to 09-05-2020 is loaded here

Rename the columns for our convenience we use the following codes

df1 = df.rename(columns = {'Sno': 'Sno'  , 
                                 'Date':'Date',
                                 'Time':'Time',
                                 'State/UnionTerritory': 'STUT',
                                 'ConfirmedIndianNational':'CInNat',
                                 'ConfirmedForeignNational':'CFnNat',
                                 'Cured':'Cured',
                                 'Deaths':'Deaths',
                                 'Confirmed':'Confirmed'})
df1
subset1=df1[['STUT','Date','Confirmed','Cured','Deaths']]
print(subset1)
We rername the set as dfIndia 
dfIndia=subset1
We now  fit kmeans with these  new variable and use a value of 8 for the   random_state hyperparameter: 
X = dfIndia[['Confirmed', 'Cured']] 
kmeans = KMeans(random_state=8) 
We then use the predict method from the sklearn package, 
predict the clustering assignment from the input variable and then  save the results into a 
new variable called y_preds, and display the last 10 predictionskmeans.fit(X) 
The code for the same as follows
y_preds = kmeans.predict(X) 
y_preds[-10:]
We then save the e predicted clusters back to the DataFrame by creating a new column called 'cluster' 
and print the last 10 rows of the DataFrame using the .tail() method from the pandas package: 
The code for the same as follows 
dfIndia ['cluster'] = y_preds df.tail(10) 
We then generate a  pivot table with the averages of the two columns for each cluster value using the pivot_table method from 
the pandas package with the following parameters: 'Confimed' and 'Cured'
We draw the scatterplot using the altair with the codes

scatter_plot = alt.Chart(dfIndia).mark_circle()
scatter_plot.encode(x='Confirmed',y='Cured', color='cluster:N',tooltip=['cluster','STUT','Date','Confirmed','Cured','Deaths']).interactive()
We can see that k-means has grouped the observations into eight different clusters based on the value of the two variables.
For instance, all the low-value data points have been assigned to cluster 7, while the ones with extremely 
high values belong to cluster 2.
So, k-means has grouped the data points that share similar behaviors.
For the interactive graph you can visit this link ,CLUSTER MAP INDIA INTERACTIVE

Using the optimum cluster method by using the code

kmeans = KMeans(random_state=42, n_clusters=3) 
kmeans.fit(X) 
dfIndia['cluster2'] = kmeans.predict(X) 
scatter_plot.encode(x='Confirmed',y='Cured',color='cluster2:N',tooltip=['cluster','Date','STUT','Confirmed','Cured','Deaths'] ).interactive()




This is very different compared to our initial results. 
Looking at the three clusters, we can see that: 
• The first cluster (blue) represents low values for both Confirmed and Cured. 
• The second cluster (orange) is for medium Confirmed and Cured.
•  The third cluster (red) is grouping all dates with Confirmed values above 

The interactive graph for the above can be seen here : 3 CLUSTER INTERACTIVE MAP COVID INDIA

I HAVE OBTAINED THE ABOVE CLUSTER INTERACTIVE GRAPHS FOR MAHARASHTRA STATE

The interactive graph for the same is here : INTERACTIVE 8 CLUSTER GRAPH MAHARASHTRA

and the 3 clusters optimum graph is as shown here

Ine interactive graph for the above is here : 3 CLUSTER INTERACTIVE MAP MAHARASHTRA

We can choose 4 clusters also but overall the optimum clustering technique of the given sample would be the same.

Search This Blog

COVID19 - DATA SCIENCE ANALYSIS

CLUSTERING STUDY OF COVID CASES INDIA AND FOR MAHARASHTRA, GUJARAT , KERALA , KARNATAKA UPDATED 09 - 05- 2020

Comments

Post a Comment