Monday, September 23, 2024

Hierarchical clustering

Hierarchical clustering is a method of cluster analysis in machine learning that builds a hierarchy of clusters. It can be classified into two main types:

1. **Agglomerative Hierarchical Clustering** (bottom-up approach): 
   - Starts with each data point as its own cluster.
   - Pairs of clusters are merged step-by-step based on a similarity metric (like Euclidean distance or correlation) until all points belong to a single cluster or a stopping criterion is reached.
   - A **dendrogram** is used to represent the hierarchy, where each merge represents a point in the tree.

2. **Divisive Hierarchical Clustering** (top-down approach):
   - Starts with all data points in a single cluster.
   - The cluster is recursively split until each point is in its own cluster, or a predefined stopping condition is met.

### Key Steps:
- **Distance Matrix Calculation**: A matrix of pairwise distances between data points is computed.
- **Linkage Criteria**: Determines how distances between clusters are calculated. Common methods include:
   - **Single Linkage**: Distance between the closest pair of points in two clusters.
   - **Complete Linkage**: Distance between the farthest pair of points.
   - **Average Linkage**: Average distance between points in the clusters.

### Pros and Cons:
- **Advantages**:
  - Does not require the number of clusters to be specified in advance.
  - Works well for small datasets and can reveal the underlying structure in data.
- **Disadvantages**:
  - Computationally expensive for large datasets (O(n²) time complexity).
  - Sensitive to noise and outliers.

Hierarchical clustering is commonly used in bioinformatics, market segmentation, and social network analysis where the relationship between data points is more complex.
------
Here’s an example of performing **Hierarchical Clustering** using **base R**, including how to visualize the results with a dendrogram:

### 1. **Step-by-Step Example**: 

We'll use the `iris` dataset, which is included in R.

#### Step 1: Load the dataset
```r
# Load the iris dataset
data(iris)
# Remove the species column for clustering
iris_data <- iris[, -5]
```

#### Step 2: Compute the distance matrix
We compute the Euclidean distance between data points.
```r
# Compute Euclidean distance
dist_matrix <- dist(iris_data, method = "euclidean")
```

#### Step 3: Perform Hierarchical Clustering
Use the `hclust()` function to apply agglomerative hierarchical clustering.
```r
# Perform hierarchical clustering using complete linkage
hclust_result <- hclust(dist_matrix, method = "complete")
```

#### Step 4: Plot the Dendrogram
A dendrogram is a tree-like diagram showing the hierarchy of clusters.
```r
# Plot the dendrogram
plot(hclust_result, main = "Dendrogram of Iris Data", xlab = "", sub = "")
```

#### Step 5: Cut the Tree to Form Clusters
To form a specific number of clusters (e.g., 3), you can cut the dendrogram:
```r
# Cut tree into 3 clusters
clusters <- cutree(hclust_result, k = 3)
# Add the cluster assignments to the dataset
iris$Cluster <- clusters
head(iris)
```

### 2. **Explanation of Output**:
- The dendrogram plot shows the hierarchical structure of the clusters. You can decide the number of clusters by cutting the tree at a desired height.
- The `cutree()` function assigns cluster labels to each observation based on the number of clusters you specify.

### 3. **Interpretation**:
- By visualizing the dendrogram, you can explore how data points are grouped together.
- By cutting the tree at a specific height, we can decide how many clusters we want and assign each data point to a cluster accordingly.

This method uses **complete linkage**, but you can change it to other types of linkage, such as `single`, `average`, or `ward.D2`.

No comments:

Post a Comment