Connectivity Based Clustering

Description

Connectivity Based Clustering builds the clusters based on the notion that the vectors of data points in space exhibit more similarity to each other than the data points lying farther away.

Why to use

To form clusters of textual data.

When to use

When the number of clusters is not known.

When not to use

  • When data is labeled.
  • When the number of clusters is known.
  • When the dataset is very large.

Prerequisites

Input data should be of text type and should not contain special characters and numbers.

Input

Textual Data

Output

Data divided into clusters

Statistical Methods used

  • Linkage Metric
  • Linkage Criterion

Limitations

  • Cannot handle big data well.
  • Does not work well with very large data sets.
  • Does not work with missing data.
  • The time complexity for clustering can result in very long computation times compared to efficient algorithms like k-means.

Connectivity Based Clustering is located under rubitext ( ) in Clustering, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to Properties of Connectivity Based Clustering.

Figure: Connectivity Based Clustering

Connectivity-based clustering is also called hierarchical clustering because it builds clusters in a hierarchy. In clustering, the data points closer to each other exhibit more similarity than those away from each other.

The algorithm starts with assigning data points to a cluster of their own. Then two nearest clusters are merged to form a single cluster. In the end, the algorithm terminates with only one cluster remaining.

There are two approaches to this model. In the first approach, data points are classified into separate clusters and then aggregated as the distance between them decreases.

In the second approach, data points are distributed into a single large cluster and then segregated as the distance between them increases. Rubiscape uses this approach.

Properties of Connectivity Based Clustering

The available properties of Connectivity Based Clustering are as shown in the figure given below.

Figure: Properties of Connectivity Based Clustering

The table given below describes the different fields present on the properties of Connectivity Based Clustering.

Table: Description of Fields present on the Properties of Connectivity Based Clustering

Field

Description

Remark

Task Name

It is the name of the task selected on the workbook canvas.

You can click the text field to edit or modify the name of the task as required.

Text

It allows you to select Independent variables.

  • You can select more than one variable.
  • You can select any type of variable.

Number of Clusters

It allows you to enter the number of clusters you want to create.

The default value is 8.

Advanced

Linkage Metric

It allows you to select the metric used to compute the linkage.

  • The available options are - Euclidean, L1, L2, Manhattan, Cosine, or Precomputed.
  • If linkage Criterion is Ward, only Euclidean is accepted.
  • For Precomputed, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.

Linkage Criterion

It allows you to select the metric used for the merge strategy.

The available options are –

  • Ward – Minimizes the sum of squared differences within all clusters.
  • Complete – Minimizes the distance between data points of pairs of clusters.
  • Average – Minimizes the average distance between all observations of pairs of clusters.

Example of Connectivity Based Clustering

Consider a dataset of musical instruments review. A snippet of input data is shown in the figure given below.

Figure: Input Data Snippet

After using the Connectivity Based Clustering, the following results are displayed.

Figure: Output of Connectivity Based Clustering


As seen in the above figure, the number of clusters and each cluster's size are displayed along with the Silhouette Score.


Table of Contents