Key Points
- Hierarchical clustering is a type of unsupervised learning that groups observations based on their similarity or dissimilarity without specifying the number of clusters beforehand.
- To perform hierarchical clustering in RStudio, you must install and load two packages: factoextra and cluster. Then, you need to scale your data using the scale() function and perform hierarchical clustering using the agnes() function from the cluster package.
- To visualize and interpret your clustering results, you can use a dendrogram, a tree-like diagram showing how the clusters are nested within each other. You can plot a dendrogram using the fviz_dend() function from the factoextra package.
Hi, I'm Zubair Goraya, a data analyst and a writer for Data Analysis, a website that provides tutorials on how to use RStudio for various data analysis tasks. In this article, I will show you how to perform hierarchical clustering in RStudio, a powerful technique for finding groups of similar observations in your data.
Performing Hierarchical Clustering in RStudio: Your Complete Guide
Hierarchical clustering is a type of unsupervised learning, meaning you don't need to have predefined labels or categories for your data. Instead, you let the algorithm discover the structure and patterns in your data by grouping observations based on their similarity or dissimilarity.
One of the advantages of hierarchical clustering is that you don't need to specify the number of clusters beforehand, unlike other methods, such as k-means clustering. Instead, you can use a graphical representation called a dendrogram to visualize the hierarchy of clusters and decide how many clusters you want to use based on your analysis goals.
What is Hierarchical Clustering, and How Does It Work?
The basic idea of hierarchical clustering is to start with each observation as its cluster and then merge the most similar clusters until all observations are in one big cluster. The result is a tree-like structure that shows how the clusters are nested within each other.
There are two main steps in hierarchical clustering:
- Calculate the pairwise dissimilarity between each observation in the dataset. Choosing a distance metric that suits your data type and analysis objectives would be best. For example, you can use Euclidean distance for continuous numerical data or Jaccard distance for binary or categorical data.
- Fuse observations into clusters. You need to choose a
method for determining how close two clusters are and which ones to merge at
each step. There are several methods available, such as:
- Complete linkage: Use the maximum distance between two observations from different clusters as the cluster distance.
- Single linkage: Use the minimum distance between two observations from different clusters as the cluster distance.
- Average linkage: Use the average distance between all pairs of observations from different clusters as the cluster distance.
- Centroid linkage: Use the distance between the centroids (mean vectors) of two clusters as the cluster distance.
- Ward's method: Use the increase in the total within-cluster variance after merging two clusters as the cluster distance.
Some methods may produce better results depending on your data and analysis goals. For example, complete linkage produces compact and balanced clusters, while single linkage produces long and chain-like clusters.
How to Perform Hierarchical Clustering in RStudio
To perform hierarchical clustering in RStudio, you must install and load two packages: factoextra and cluster. The factoextra package provides several functions for visualizing and evaluating clustering results, while the cluster package provides the agnes() function for performing agglomerative hierarchical clustering.
You can install and load these packages using the following code:
# Install packages install.packages("factoextra") install.packages("cluster") # Load packages library(factoextra) library(cluster)
Click on below button and View the live preview of code
Next, you need to load and prepare your data. For this tutorial, I will use a sample dataset called USArrests, which contains statistics on violent crime rates by the US state in 1973. The dataset has four variables: Murder, Assault, urban pop (percentage of population in urban areas), and Rape.You can load this dataset using the following code:
# Load data data("USArrests") # View first six rows head(USArrests)
Click on below button and View the live preview of code
Standardizes each Variable
Before performing hierarchical clustering, you need to scale your data so that each variable has the same range and variance. It is important because the distance metric used for clustering is sensitive to the scale of the variables.
If you don't scale your data, variables with larger values will have more influence on the clustering results than variables with smaller values.
You can scale your data using the scale() function, which standardizes each variable by subtracting its mean and dividing by its standard deviation. This way, each variable will have a mean of zero and a standard deviation of one.
You can scale your data using the following code:
# Scale data USArrests.scaled <- first="" head="" pre="" rows="" rrests.scaled="" rrests="" scale="" six="" view="">->
Click on below button and View the live preview of code
Perform Hierarchical Clustering
Now you can perform hierarchical clustering using the agnes() function from the cluster package.
The agnes() function has the following syntax:
agnes(data, method)
where:
- Data is the name of the dataset.
- The method is the method for measuring the distance between clusters.
You can choose one of the following methods: "average" for average linkage, "single" for single linkage, "complete" for complete linkage, "ward" for Ward's method, or "gaverage" for generalized average linkage.
For this tutorial, I will use the "ward" method, which tends to produce compact and spherical clusters.
You can perform hierarchical clustering using the following code:
# Perform hierarchical clustering hc <- agnes="" hc="" method="ward" pre="" results="" rrests.scaled="" view="">->
Click on below button and View the live preview of code
The output shows information about the clustering process, such as the number of observations, the agglomeration method used, and the agglomeration coefficient (a measure of how well the clusters are separated).
How to Plot and Interpret a Dendrogram
To visualize the results of hierarchical clustering, you can use a dendrogram, a tree-like diagram showing how the clusters are nested within each other.
You can plot a dendrogram using the fviz_dend() function from the factoextra package.
The fviz_dend() function has the following syntax:
fviz_dend(object, ...)
where:
- `object object is the output of the agnes() function.
- ... are other arguments that you can use to customize the appearance of the dendrogram, such as the color, the labels, the size, etc.
You can plot a dendrogram using the following code:
# Plot dendrogram fviz_dend(hc, cex = 0.6, k = 4, rect = TRUE, show_labels = TRUE)
Click on below button and View the live preview of code
The dendrogram shows the hierarchical structure of the clusters. Each leaf represents an observation, and each node represents a cluster. The height of each node indicates the distance between the merged clusters. The lower the height, the more similar the clusters are.
You can also use the k argument to specify the number of clusters you want to use and the rect argument to draw rectangles around each cluster. The cex argument controls the size of the labels, and the show_labels argument controls whether to show or hide the labels.
In this example, I used k = 4 to obtain four clusters and rect = TRUE to highlight them. You can see that each cluster has a different color and a different number of observations.
How to Cut the Dendrogram at Different Levels
One of the benefits of hierarchical clustering is that you can choose the number of clusters based on your analysis objectives and your interpretation of the dendrogram. You can cut the dendrogram at different levels to obtain different numbers of clusters.
You can use the cutree() function from the base R package to cut the dendrogram at a certain level.
The cutree() function has the following syntax:
cutree(tree, k)
where:
- tree is the output of the agnes() function.
- k is the number of clusters you want to obtain.
You can cut the dendrogram at different levels using the following code:
# Cut dendrogram at k = 2 cutree(hc, k = 2)
Click on below button and View the live preview of code
The output shows a vector of cluster memberships for each observation. For example, Alabama belongs to Cluster 1 and Alaska belongs to cluster 2.
Cutting the dendrogram at k = 2 produces two large clusters that roughly correspond to high-crime and low-crime states.
Conclusion
In this article, you learned how to perform hierarchical clustering in RStudio using the agnes() function from the cluster package and how to visualize and evaluate your clustering results using various functions from the factoextra package.
Hierarchical clustering is a valuable technique for finding groups of similar observations in your data without specifying the number of clusters beforehand. You can use a dendrogram to explore the hierarchy of clusters and choose the optimal number of clusters based on your analysis objectives. You can also use various metrics to measure the quality of your clusters and identify potential outliers or misclassified observations.
I hope you found this article helpful and informative. If you have any questions or feedback, please comment below. Thank you for reading!
FAQs
Q: What is hierarchical clustering?
A: Hierarchical clustering is a type of unsupervised learning that groups observations based on their similarity or dissimilarity without specifying the number of clusters beforehand.
Q: How does hierarchical clustering work?
A: Hierarchical clustering works by starting with each observation as its cluster and then merging the most similar clusters until all observations are in one big cluster. The result is a tree-like structure that shows how the clusters are nested within each other.
Q: How to perform hierarchical clustering in RStudio?
A: To perform hierarchical clustering in RStudio, you must install and load two packages: factoextra and cluster. Then, you need to scale your data using the scale() function and perform hierarchical clustering using the agnes() function from the cluster package. Using the method argument, you can choose the method for measuring the distance between clusters.
Q: How to plot and interpret a dendrogram?
A: To plot and interpret a dendrogram, use the fviz_dend() function from the factoextra package. A dendrogram is a tree-like diagram that shows how the clusters are nested within each other. Each leaf represents an observation, and each node represents a cluster. The height of each node indicates the distance between the merged clusters. The lower the height, the more similar the clusters are. You can also use the k argument to specify the number of clusters you want to use and the rect argument to draw rectangles around each cluster.
Q: How to cut the dendrogram at different levels?
A: To cut the dendrogram at different levels, you can use the cutree() function from the base R package. This function returns a vector of cluster memberships for each observation based on the number of clusters you specify using the k argument. You can cut the dendrogram at different levels to obtain different numbers of clusters based on your analysis objectives.
Q: How to evaluate the quality of clusters?
A: To evaluate the quality of clusters, you can use various metrics such as the silhouette coefficient, the Dunn index, and the Calinski-Harabasz index. These metrics measure how well each observation fits within its cluster and how well it is separated from other clusters. You can calculate and plot these metrics using various functions from the factoextra package, such as fviz_silhouette(), dunn(), and calinhara().
Q: What are the advantages of hierarchical clustering?
A: Some of the advantages of hierarchical clustering are:
- It does not require you to specify the number of clusters beforehand
- It produces a graphical representation of the clustering process
- It allows you to explore different levels of granularity in your data
- It can handle outliers and noise in your data
Q: What are the disadvantages of hierarchical clustering?
A: Some of the disadvantages of hierarchical clustering are:
- It can be computationally expensive for large datasets
- It can be sensitive to the choice of distance metric and linkage method
- It can produce different results depending on the order of observations in your data
- It does not allow you to reassign observations to different clusters once they are merged
Q: What are some applications of hierarchical clustering?
A: Some of the applications of hierarchical clustering are:
- Market segmentation
- Document classification
- Image analysis
- Bioinformatics
Q: What are some alternatives to hierarchical clustering?
A: Some of the alternatives to hierarchical clustering are:
- K-means clustering
- DBSCAN clustering
- Spectral clustering
- Gaussian mixture models