Choose Cluster Analysis Method
This topic provides a brief overview of the available clustering methods in Statistics and Machine Learning Toolbox™.
Clustering Methods
Cluster analysis, also called segmentation analysis or taxonomy analysis, is a common unsupervised learning method. Unsupervised learning is used to draw inferences from data sets consisting of input data without labeled responses. For example, you can use cluster analysis for exploratory data analysis to find hidden patterns or groupings in unlabeled data.
Cluster analysis creates groups, or clusters, of data. Objects that belong to the same cluster are similar to one another and distinct from objects that belong to different clusters. To quantify "similar" and "distinct," you can use a dissimilarity measure (or distance metric) that is specific to the domain of your application and your data set. Also, depending on your application, you might consider scaling (or standardizing) the variables in your data to give them equal importance during clustering.
Statistics and Machine Learning Toolbox provides functionality for these clustering methods:
Hierarchical Clustering
Hierarchical clustering groups data over a variety of scales by creating a cluster tree, or dendrogram. The tree is not a single set of clusters, but rather a multilevel hierarchy, where clusters at one level combine to form clusters at the next level. This multilevel hierarchy allows you to choose the level, or scale, of clustering that is most appropriate for your application. Hierarchical clustering assigns every point in your data to a cluster.
Use clusterdata
to perform
hierarchical clustering on input data. clusterdata
incorporates the pdist
, linkage
, and cluster
functions, which you
can use separately for more detailed analysis. The dendrogram
function plots the
cluster tree. For more information, see Introduction to Hierarchical Clustering.
k-Means and k-Medoids Clustering
k-means clustering and k-medoids clustering partition data into k mutually exclusive clusters. These clustering methods require that you specify the number of clusters k. Both k-means and k-medoids clustering assign every point in your data to a cluster; however, unlike hierarchical clustering, these methods operate on actual observations (rather than dissimilarity measures), and create a single level of clusters. Therefore, k-means or k-medoids clustering is often more suitable than hierarchical clustering for large amounts of data.
Use kmeans
and kmedoids
to implement
k-means clustering and k-medoids
clustering, respectively. For more information, see Introduction to k-Means
Clustering and k-Medoids
Clustering.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is a density-based algorithm that identifies arbitrarily shaped clusters and outliers (noise) in data. During clustering, DBSCAN identifies points that do not belong to any cluster, which makes this method useful for density-based outlier detection. Unlike k-means and k-medoids clustering, DBSCAN does not require prior knowledge of the number of clusters.
Use dbscan
to perform clustering on an input data matrix or on
pairwise distances between observations. For more information, see Introduction to DBSCAN.
Gaussian Mixture Model
A Gaussian mixture model (GMM) forms clusters as a mixture of multivariate normal density components. For a given observation, the GMM assigns posterior probabilities to each component density (or cluster). The posterior probabilities indicate that the observation has some probability of belonging to each cluster. A GMM can perform hard clustering by selecting the component that maximizes the posterior probability as the assigned cluster for the observation. You can also use a GMM to perform soft, or fuzzy, clustering by assigning the observation to multiple clusters based on the scores or posterior probabilities of the observation for the clusters. A GMM can be a more appropriate method than k-means clustering when clusters have different sizes and different correlation structures within them.
Use fitgmdist
to fit a gmdistribution
object to your data.
You can also use gmdistribution
to create a GMM object
by specifying the distribution parameters. When you have a fitted GMM, you can
cluster query data by using the cluster
function. For more
information, see Cluster Using Gaussian Mixture Model.
k-Nearest Neighbor Search and Radius Search
k-nearest neighbor search finds the k closest points in your data to a query point or set of query points. In contrast, radius search finds all points in your data that are within a specified distance from a query point or set of query points. The results of these methods depend on the distance metric that you specify.
Use the knnsearch
function to find
k-nearest neighbors or the rangesearch
function to find all
neighbors within a specified distance of your input data. You can also create a
searcher object using a training data set, and pass the object and query data
sets to the object functions (knnsearch
and rangesearch
). For more information,
see Classification Using Nearest Neighbors.
Spectral Clustering
Spectral clustering is a graph-based algorithm for finding k arbitrarily shaped clusters in data. The technique involves representing the data in a low dimension. In the low dimension, clusters in the data are more widely separated, enabling you to use algorithms such as k-means or k-medoids clustering. This low dimension is based on eigenvectors of a Laplacian matrix. A Laplacian matrix is one way of representing a similarity graph that models the local neighborhood relationships between data points as an undirected graph.
Use spectralcluster
to perform spectral clustering on an input data
matrix or on a similarity matrix of a similarity graph. spectralcluster
requires that you specify the number of
clusters. However, the algorithm for spectral clustering also provides a way to
estimate the number of clusters in your data. For more information, see Partition Data Using Spectral Clustering.
Comparison of Clustering Methods
This table compares the features of available clustering methods in Statistics and Machine Learning Toolbox.
Method | Basis of Algorithm | Input to Algorithm | Requires Specified Number of Clusters | Cluster Shapes Identified | Useful for Outlier Detection |
---|---|---|---|---|---|
Hierarchical Clustering | Distance between objects | Pairwise distances between observations | No | Arbitrarily shaped clusters, depending on the specified
'Linkage' algorithm | No |
k-Means Clustering and k-Medoids Clustering | Distance between objects and centroids | Actual observations | Yes | Spheroidal clusters with equal diagonal covariance | No |
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) | Density of regions in the data | Actual observations or pairwise distances between observations | No | Arbitrarily shaped clusters | Yes |
Gaussian Mixture Models | Mixture of Gaussian distributions | Actual observations | Yes | Spheroidal clusters with different covariance structures | Yes |
Nearest Neighbors | Distance between objects | Actual observations | No | Arbitrarily shaped clusters | Yes, depending on the specified number of neighbors |
Spectral Clustering (Partition Data Using Spectral Clustering) | Graph representing connections between data points | Actual observations or similarity matrix | Yes, but the algorithm also provides a way to estimate the number of clusters | Arbitrarily shaped clusters | No |