K-Means is a widely-used unsupervised learning algorithm for partitioning data into distinct clusters. It’s a foundational tool in machine learning and data science, especially valuable when you need to identify inherent groupings within unlabeled datasets. Scikit-learn, Python’s premier machine learning library, provides a robust and efficient implementation of K-Means within its sklearn.cluster
module. This guide delves into the KMeans
class in scikit-learn, offering a detailed look at its parameters, attributes, and methods for effective clustering.
Understanding KMeans in Scikit-learn
The KMeans
class in scikit-learn aims to group data points into k clusters, where each point belongs to the cluster with the nearest mean (cluster center). The algorithm iteratively refines the cluster assignments and centroid locations to minimize the within-cluster sum of squares (inertia). Scikit-learn’s implementation offers flexibility and control through various parameters, allowing users to tailor the algorithm to their specific datasets and requirements.
Key Parameters of sklearn.cluster.KMeans
Let’s explore the parameters that govern the behavior of the KMeans
algorithm in scikit-learn:
n_clusters
: This fundamental parameter dictates the number of clusters to be formed. It’s an integer value, and choosing the optimal n_clusters
is crucial for effective clustering. Methods like silhouette analysis, as demonstrated in Selecting the number of clusters with silhouette analysis on KMeans clustering, can aid in determining the appropriate number of clusters for your data.
init
: This parameter controls the method for initializing the cluster centroids. Scikit-learn offers several options:
'k-means++'
(default): This smart initialization technique significantly speeds up convergence. It selects initial centroids based on a probability distribution of points’ contribution to inertia, ensuring better starting points and often leading to faster and more stable clustering. The algorithm is a “greedy k-means++” variant, performing multiple trials at each step to select the best centroid.'random'
: A simpler approach wheren_clusters
observations are randomly chosen from the dataset to serve as initial centroids.- array-like: You can provide a NumPy array of shape
(n_clusters, n_features)
to directly specify your own initial centroid locations. - callable: For advanced customization, you can pass a callable function that takes
X
,n_clusters
, and arandom_state
and returns the initial centroids.
Refer to A demo of K-Means clustering on the handwritten digits data for practical examples of using different init
strategies. For an in-depth evaluation of initialization impact, see Empirical evaluation of the impact of k-means initialization.
n_init
: This parameter defines the number of times the K-means algorithm is run with different centroid seeds. The algorithm returns the best result based on inertia across these runs. For datasets with high dimensionality or sparsity, multiple runs (n_init
> 1) are highly recommended to mitigate the risk of converging to a local minimum.
'auto'
(default since version 1.4): Scikit-learn intelligently determines the number of runs based on theinit
method. Ifinit='random'
or a callable is used,n_init
defaults to 10. Ifinit='k-means++'
or an array-like is provided,n_init
defaults to 1.- int: You can explicitly set the number of runs as an integer.
max_iter
: This integer parameter sets the maximum number of iterations allowed within a single K-means run. The algorithm stops iterating when convergence is reached (defined by tol
) or when max_iter
is exceeded. The default value is 300.
tol
: This float parameter specifies the tolerance for convergence. It’s the relative tolerance concerning the Frobenius norm of the difference in cluster centers between consecutive iterations. If the change in cluster centers falls below tol
, the algorithm declares convergence and stops. The default value is 1e-4
.
verbose
: An integer parameter that controls the verbosity level during the algorithm’s execution. verbose=0
(default) means no output, while higher values may provide more detailed information.
random_state
: This parameter manages the random number generation used for centroid initialization. Providing an integer random_state
ensures deterministic behavior across multiple runs, which is crucial for reproducibility. See the Glossary in scikit-learn documentation for more details on random_state
.
copy_x
: A boolean parameter that determines whether to copy the input data X
. copy_x=True
(default) preserves the original data. copy_x=False
might save memory, but it modifies the original data in-place (though the changes are reversed before the function returns). Be mindful of potential numerical differences if modifying data in place. If the input data is not C-contiguous or is a sparse matrix not in CSR format, a copy will be made regardless of copy_x
.
algorithm
: This parameter allows you to choose the K-means algorithm to use:
"lloyd"
(default): The classic Expectation-Maximization (EM) style algorithm, also known as Lloyd’s algorithm."elkan"
: An alternative variation that can be more efficient for datasets with well-separated clusters by leveraging the triangle inequality. However, it’s more memory-intensive due to the allocation of an additional array of shape(n_samples, n_clusters)
.
The Elkan algorithm was added in version 0.18, and "lloyd"
was renamed from "full"
in version 1.1.
Key Attributes of a Fitted KMeans
Object
After fitting the KMeans
estimator to your data using the fit()
method, several important attributes become available:
cluster_centers_
: A NumPy array of shape (n_clusters, n_features)
containing the coordinates of the calculated cluster centers. If the algorithm stops before full convergence (due to tol
or max_iter
), these centers might not perfectly align with the labels_
.
labels_
: A NumPy array of shape (n_samples,)
providing the cluster label for each data point in the training data.
inertia_
: A float value representing the sum of squared distances of samples to their closest cluster center. This is the objective function that K-means aims to minimize and is often used to evaluate model performance.
n_iter_
: An integer indicating the number of iterations run during the fitting process.
n_features_in_
: An integer representing the number of features seen during the fit
method. Added in version 0.24.
feature_names_in_
: A NumPy array of shape (n_features_in_,)
containing the names of features seen during fit
. This attribute is only defined when the input data X
has feature names that are all strings. Added in version 1.0.
Methods of the KMeans
Class
The KMeans
class provides several methods for fitting, predicting, and transforming data:
fit(X, y=None, sample_weight=None)
: Computes K-means clustering on the input dataX
.y
is ignored and present for API consistency.sample_weight
allows weighting individual samples (added in version 0.20). Returnsself
.fit_predict(X, y=None, sample_weight=None)
: A convenience method that performsfit(X)
followed bypredict(X)
, returning the cluster labels for each sample inX
.fit_transform(X, y=None, sample_weight=None)
: Efficiently computes clustering and transformsX
into cluster-distance space. Equivalent tofit(X).transform(X)
.get_feature_names_out(input_features=None)
: Returns output feature names for transformation, prefixed with the class name.get_metadata_routing()
: Provides metadata routing information for the object (refer to User Guide).get_params(deep=True)
: Returns a dictionary of estimator parameters.predict(X)
: Predicts the closest cluster for each sample in new dataX
, based on the fitted cluster centers.score(X, y=None, sample_weight=None)
: Returns the negative of the K-means objective function value (inertia) on the dataX
.- *`set_fit_request(, sample_weight: bool | None | str = ‘$UNCHANGED$’)
**: Allows requesting metadata to be passed to the
fit` method (relevant for meta-estimators, added in version 1.3). - *`set_output(, transform=None)
**: Configures the output container for
transformand
fit_transformmethods (added in version 1.4, including
“polars”` option). - `set_params(params)`**: Sets parameters of the estimator.
- *`set_score_request(, sample_weight: bool | None | str = ‘$UNCHANGED$’)
**: Allows requesting metadata to be passed to the
score` method (relevant for meta-estimators, added in version 1.3). transform(X)
: TransformsX
into a cluster-distance space. Each dimension represents the distance to each cluster center. The output is typically a dense array even ifX
is sparse.
Practical Examples and Further Exploration
Scikit-learn provides numerous examples showcasing the application of KMeans
:
-
Basic Example:
>>> from sklearn.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X) >>> kmeans.labels_ array([1, 1, 1, 0, 0, 0], dtype=int32) >>> kmeans.predict([[0, 0], [12, 3]]) array([1, 0], dtype=int32) >>> kmeans.cluster_centers_ array([[10., 2.], [ 1., 2.]])
-
Addressing Common Problems: See Demonstration of k-means assumptions for insights into common issues and solutions.
-
Text Document Clustering: Explore Clustering text documents using k-means for text-based applications.
-
Comparison with MiniBatchKMeans: Refer to Comparison of the K-Means and MiniBatchKMeans clustering algorithms for large datasets.
-
Comparison with BisectingKMeans: Investigate Bisecting K-Means and Regular K-Means Performance Comparison for performance considerations.
Conclusion
Scikit-learn’s KMeans
class is a powerful and versatile tool for unsupervised clustering. By understanding its parameters, attributes, and methods, data scientists and machine learning practitioners can effectively apply K-means to uncover hidden structures and patterns within their data. Remember to carefully consider parameter tuning, especially n_clusters
and init
, and explore the provided examples to master K-means clustering with scikit-learn. For large datasets, consider alternatives like MiniBatchKMeans
for improved performance.