1. **Agglomerative Hierarchical Clustering** (bottom-up approach):
- Starts with each data point as its own cluster.
- Pairs of clusters are merged step-by-step based on a similarity metric (like Euclidean distance or correlation) until all points belong to a single cluster or a stopping criterion is reached.
- A **dendrogram** is used to represent the hierarchy, where each merge represents a point in the tree.
2. **Divisive Hierarchical Clustering** (top-down approach):
- Starts with all data points in a single cluster.
- The cluster is recursively split until each point is in its own cluster, or a predefined stopping condition is met.
### Key Steps:
- **Distance Matrix Calculation**: A matrix of pairwise distances between data points is computed.
- **Linkage Criteria**: Determines how distances between clusters are calculated. Common methods include:
- **Single Linkage**: Distance between the closest pair of points in two clusters.
- **Complete Linkage**: Distance between the farthest pair of points.
- **Average Linkage**: Average distance between points in the clusters.
### Pros and Cons:
- **Advantages**:
- Does not require the number of clusters to be specified in advance.
- Works well for small datasets and can reveal the underlying structure in data.
- **Disadvantages**:
- Computationally expensive for large datasets (O(n²) time complexity).
- Sensitive to noise and outliers.
Hierarchical clustering is commonly used in bioinformatics, market segmentation, and social network analysis where the relationship between data points is more complex.
------
Here’s an example of performing **Hierarchical Clustering** using **base R**, including how to visualize the results with a dendrogram:
### 1. **Step-by-Step Example**:
We'll use the `iris` dataset, which is included in R.
#### Step 1: Load the dataset
```r
# Load the iris dataset
data(iris)
# Remove the species column for clustering
iris_data <- iris[, -5]
```
#### Step 2: Compute the distance matrix
We compute the Euclidean distance between data points.
```r
# Compute Euclidean distance
dist_matrix <- dist(iris_data, method = "euclidean")
```
#### Step 3: Perform Hierarchical Clustering
Use the `hclust()` function to apply agglomerative hierarchical clustering.
```r
# Perform hierarchical clustering using complete linkage
hclust_result <- hclust(dist_matrix, method = "complete")
```
#### Step 4: Plot the Dendrogram
A dendrogram is a tree-like diagram showing the hierarchy of clusters.
```r
# Plot the dendrogram
plot(hclust_result, main = "Dendrogram of Iris Data", xlab = "", sub = "")
```
#### Step 5: Cut the Tree to Form Clusters
To form a specific number of clusters (e.g., 3), you can cut the dendrogram:
```r
# Cut tree into 3 clusters
clusters <- cutree(hclust_result, k = 3)
# Add the cluster assignments to the dataset
iris$Cluster <- clusters
head(iris)
```
### 2. **Explanation of Output**:
- The dendrogram plot shows the hierarchical structure of the clusters. You can decide the number of clusters by cutting the tree at a desired height.
- The `cutree()` function assigns cluster labels to each observation based on the number of clusters you specify.
### 3. **Interpretation**:
- By visualizing the dendrogram, you can explore how data points are grouped together.
- By cutting the tree at a specific height, we can decide how many clusters we want and assign each data point to a cluster accordingly.
This method uses **complete linkage**, but you can change it to other types of linkage, such as `single`, `average`, or `ward.D2`.
No comments:
Post a Comment