ddroy_course: Hierarchical clustering

Hierarchical clustering is a method of cluster analysis in machine learning that builds a hierarchy of clusters. It can be classified into two main types:

1. **Agglomerative Hierarchical Clustering** (bottom-up approach):

- Starts with each data point as its own cluster.

- Pairs of clusters are merged step-by-step based on a similarity metric (like Euclidean distance or correlation) until all points belong to a single cluster or a stopping criterion is reached.

- A **dendrogram** is used to represent the hierarchy, where each merge represents a point in the tree.

2. **Divisive Hierarchical Clustering** (top-down approach):

- Starts with all data points in a single cluster.

- The cluster is recursively split until each point is in its own cluster, or a predefined stopping condition is met.

### Key Steps:

- **Distance Matrix Calculation**: A matrix of pairwise distances between data points is computed.

- **Linkage Criteria**: Determines how distances between clusters are calculated. Common methods include:

- **Single Linkage**: Distance between the closest pair of points in two clusters.

- **Complete Linkage**: Distance between the farthest pair of points.

- **Average Linkage**: Average distance between points in the clusters.

### Pros and Cons:

- **Advantages**:

- Does not require the number of clusters to be specified in advance.

- Works well for small datasets and can reveal the underlying structure in data.

- **Disadvantages**:

- Computationally expensive for large datasets (O(n²) time complexity).

- Sensitive to noise and outliers.

Hierarchical clustering is commonly used in bioinformatics, market segmentation, and social network analysis where the relationship between data points is more complex.

------

Here’s an example of performing **Hierarchical Clustering** using **base R**, including how to visualize the results with a dendrogram:

### 1. **Step-by-Step Example**:

We'll use the `iris` dataset, which is included in R.

#### Step 1: Load the dataset

```r

# Load the iris dataset

data(iris)

# Remove the species column for clustering

iris_data <- iris[, -5]

```

#### Step 2: Compute the distance matrix

We compute the Euclidean distance between data points.

```r

# Compute Euclidean distance

dist_matrix <- dist(iris_data, method = "euclidean")

```

#### Step 3: Perform Hierarchical Clustering

Use the `hclust()` function to apply agglomerative hierarchical clustering.

```r

# Perform hierarchical clustering using complete linkage

hclust_result <- hclust(dist_matrix, method = "complete")

```

#### Step 4: Plot the Dendrogram

A dendrogram is a tree-like diagram showing the hierarchy of clusters.

```r

# Plot the dendrogram

plot(hclust_result, main = "Dendrogram of Iris Data", xlab = "", sub = "")

```

#### Step 5: Cut the Tree to Form Clusters

To form a specific number of clusters (e.g., 3), you can cut the dendrogram:

```r

# Cut tree into 3 clusters

clusters <- cutree(hclust_result, k = 3)

# Add the cluster assignments to the dataset

iris$Cluster <- clusters

head(iris)

```

### 2. **Explanation of Output**:

- The dendrogram plot shows the hierarchical structure of the clusters. You can decide the number of clusters by cutting the tree at a desired height.

- The `cutree()` function assigns cluster labels to each observation based on the number of clusters you specify.

### 3. **Interpretation**:

- By visualizing the dendrogram, you can explore how data points are grouped together.

- By cutting the tree at a specific height, we can decide how many clusters we want and assign each data point to a cluster accordingly.

This method uses **complete linkage**, but you can change it to other types of linkage, such as `single`, `average`, or `ward.D2`.

ddroy_course

Monday, September 23, 2024

Hierarchical clustering

No comments:

Post a Comment