CLUSTERING STUDY OF COVID CASES INDIA AND FOR MAHARASHTRA, GUJARAT , KERALA , KARNATAKA UPDATED 09 - 05- 2020
Clustering algorithms are very popular in the data science industry for grouping similar data points and detecting outliers.
Clustering analysis performed on data would uncover natural patterns by grouping similar data points.
We will analyse the COVID19INDIA data using clustering techniques, one of the more popular techniques being Clustering with k-means.
k-means is one of the most popular clustering algorithms (if not the most popular) among data scientists due to its simplicity and high performance.
ts origins date back as early as 1956, when a famous mathematician named Hugo Steinhaus laid its foundations, but it was a decade later that another researcher called James MacQueen named this approach k-means.
The objective of k-means is to group similar data points (or observations) together that will form a cluster which is done automatically for you from the data.
I have applied the k-means algorithm to find a cluster for India and then to a few states Maharashtra, Gujarat which has still not recovered from the pandemic and also those states which are on the way to becoming COVID free like Kerala and Karnataka.
The entire India dataframe updated to 09-05-2020 is loaded here
Rename the columns for our convenience we use the following codes
df1 = df.rename(columns = {'Sno': 'Sno' ,
'Date':'Date',
'Time':'Time',
'State/UnionTerritory': 'STUT',
'ConfirmedIndianNational':'CInNat',
'ConfirmedForeignNational':'CFnNat',
'Cured':'Cured',
'Deaths':'Deaths',
'Confirmed':'Confirmed'})
df1
subset1=df1[['STUT','Date','Confirmed','Cured','Deaths']]
print(subset1)
We rername the set as dfIndia
dfIndia=subset1
We now fit kmeans with these new variable and use a value of 8 for the random_state hyperparameter:
X = dfIndia[['Confirmed', 'Cured']]
kmeans = KMeans(random_state=8)
We then use the predict method from the sklearn package,
predict the clustering assignment from the input variable and then save the results into a
new variable called y_preds, and display the last 10 predictionskmeans.fit(X)
The code for the same as follows
y_preds = kmeans.predict(X)
y_preds[-10:]
We then save the e predicted clusters back to the DataFrame by creating a new column called 'cluster'
and print the last 10 rows of the DataFrame using the .tail() method from the pandas package:
The code for the same as follows
dfIndia ['cluster'] = y_preds df.tail(10)
We then generate a pivot table with the averages of the two columns for each cluster value using the pivot_table method from
the pandas package with the following parameters: 'Confimed' and 'Cured'
We draw the scatterplot using the altair with the codes
scatter_plot = alt.Chart(dfIndia).mark_circle()
scatter_plot.encode(x='Confirmed',y='Cured', color='cluster:N',tooltip=['cluster','STUT','Date','Confirmed','Cured','Deaths']).interactive()
We can see that k-means has grouped the observations into eight different clusters based on the value of the two variables.
For instance, all the low-value data points have been assigned to cluster 7, while the ones with extremely
high values belong to cluster 2.
So, k-means has grouped the data points that share similar behaviors.
For the interactive graph you can visit this link ,CLUSTER MAP INDIA INTERACTIVE
Using the optimum cluster method by using the code
This is very different compared to our initial results.
Looking at the three clusters, we can see that:
• The first cluster (blue) represents low values for both Confirmed and Cured.
• The second cluster (orange) is for medium Confirmed and Cured.
• The third cluster (red) is grouping all dates with Confirmed values above
The interactive graph for the above can be seen here : 3 CLUSTER INTERACTIVE MAP COVID INDIA
The interactive graph for the same is here : INTERACTIVE 8 CLUSTER GRAPH MAHARASHTRA
and the 3 clusters optimum graph is as shown here
Ine interactive graph for the above is here : 3 CLUSTER INTERACTIVE MAP MAHARASHTRA
We can choose 4 clusters also but overall the optimum clustering technique of the given sample would be the same.
Comments
Post a Comment