Connectivity Based Clustering
Connectivity Based Clustering builds the clusters based on the notion that the vectors of data points in space exhibit more similarity to each other than the data points lying farther away.
Why to use
To form clusters of textual data.
When to use
When the number of clusters is not known.
When not to use
Input data should be of text type and should not contain special characters and numbers.
Data divided into clusters
Statistical Methods used
Connectivity Based Clustering is located under rubitext ( ) in Clustering, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to Properties of Connectivity Based Clustering.
Figure: Connectivity Based Clustering
Connectivity-based clustering is also called hierarchical clustering because it builds clusters in a hierarchy. In clustering, the data points closer to each other exhibit more similarity than those away from each other.
The algorithm starts with assigning data points to a cluster of their own. Then two nearest clusters are merged to form a single cluster. In the end, the algorithm terminates with only one cluster remaining.
There are two approaches to this model. In the first approach, data points are classified into separate clusters and then aggregated as the distance between them decreases.
In the second approach, data points are distributed into a single large cluster and then segregated as the distance between them increases. Rubiscape uses this approach.
Properties of Connectivity Based Clustering
The available properties of Connectivity Based Clustering are as shown in the figure given below.
Figure: Properties of Connectivity Based Clustering
The table given below describes the different fields present on the properties of Connectivity Based Clustering.
Table: Description of Fields present on the Properties of Connectivity Based Clustering
It is the name of the task selected on the workbook canvas.
You can click the text field to edit or modify the name of the task as required.
It allows you to select Independent variables.
Number of Clusters
It allows you to enter the number of clusters you want to create.
The default value is 8.
It allows you to select the metric used to compute the linkage.
It allows you to select the metric used for the merge strategy.
The available options are –
Example of Connectivity Based Clustering
Consider a dataset of musical instruments review. A snippet of input data is shown in the figure given below.
Figure: Input Data Snippet
After using the Connectivity Based Clustering, the following results are displayed.
Figure: Output of Connectivity Based Clustering
As seen in the above figure, the number of clusters and each cluster's size are displayed along with the Silhouette Score.