kmeans

The k-means clustering algorithm identifies clusters in numerical data where individual data points are assigned to the cluster with the nearest mean (cluster centroid).

The number of clusters to detect is specified as an optional parameter (default is 2).

The kmeans command adds a new column to the dataset storing the cluster label for the corresponding datapoint on each row.

Clustering does not work with missing values, so rows with missing (NA) values in the clustered columns are automatically removed before clustering.

Usage

gurita kmeans [-h] [-c COLUMN [COLUMN ...]] [--name NAME] [-n NCLUSTERS]

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

  • -c COLUMN [COLUMN...]

  • --col COLUMN [COLUMN...]

apply clustering to specified numerical columns

kmeans columns

  • -n NCLUSTERS

  • --nclusters NCLUSTERS

identify this many clusters in the data

number of clusters

--name NAME

choose a column name for the new cluster label column (default: cluster)

new column name

See also

Gaussian Mixture Models provide another way to cluster data.

Simple example

The following command clusters the numerical columns in the iris.csv file:

gurita kmeans < iris.csv

The output is quite long so we can adjust the command to look at only the first few rows using the head command:

gurita kmeans + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,cluster
5.1,3.5,1.4,0.2,setosa,1
4.9,3.0,1.4,0.2,setosa,1
4.7,3.2,1.3,0.2,setosa,1
4.6,3.1,1.5,0.2,setosa,1
5.0,3.6,1.4,0.2,setosa,1

A new categorical column called cluster is added to the dataset, this holds the cluster labels for the datapoint on each row.

Each cluster is labelled using a natural number (0,1,2 …).

We can get an overview of the new cluster column by using the describe command after clustering:

gurita kmeans + describe -c cluster < iris.csv

The output of the above command is shown below:

        cluster
count       150
unique        2
top           0
freq         97

We can see that there are 150 data points (150 rows) and 2 unique values in the cluster column (these are the labels 0 and 1). The most frequent label is 0 which occurs 97 times (and thus the label 1 must occur 150-97=53 times).

Note

Despite the use of numbers for cluster labels, Gurita treats them as categorical values.

This is beneficial when it comes to plotting data using cluster labels because it means that the plots will correctly interpret the labels as catergorical values and render them accordingly.

For example we might like to make a box plot comparing the petal_length across the two clusters:

gurita kmeans + box -x cluster -y petal_length < iris.csv

The output of the above command is written to box.cluster.petal_length.png:

Box plot comparing petal length across two k-means clusters in the iris.csv dataset

Getting help

The full set of command line arguments for kmeans can be obtained with the -h or --help arguments:

gurita kmeans -h

Cluster data from specified numerical columns

-c NAME [NAME ...], --col NAME [NAME ...]

By default, if no column names are specified, clustering is performed on all of the numerical columns in the dataset.

However it is possible to perform clustering on a specific subset of columns via the -c/--col argument.

For example, the following command performs k-means clustering on just the columns sepal_length, sepal_width, and petal_length (and hence ignores the petal_width column):

gurita kmeans -c sepal_length sepal_width petal_length < iris.csv

Note

Non-numeric columns will be ignored by kmeans even if they are specified as arguments to -c/--col.

Choose number of clusters to identify

-n NCLUSTERS, --nclusters NCLUSTERS

By default kmeans identifies two clusters in the data. However, this can be changed with the -n/--nclusters argument.

For example, the following command finds three clusters in the iris.csv file:

gurita kmeans -n 3 < iris.csv

We can check the number of values in each cluster using the grouby command:

gurita kmeans -n 3 + groupby -k cluster < iris.csv

The output of the above command is shown below:

cluster,size
0,62
1,50
2,38

We can observe three clusters labelled 0,1,2 with 39,61,50 members respectively.

Choose a name for the new cluster label column

--name NAME

The kmeans command adds an extra categorical column called cluster to the dataset to store the cluster labels for each row.

The cluster labels are natural numbers (non-negative integers) from 0 upwards (0, 1, 2, …).

The name of the extra column can be changed with the --name argument.

The following command specifies that grouping should be used as the prefix for the newly added columns:

gurita kmeans --name grouping < iris.csv

By chaining this command with head we can inspect the first few rows of the output:

gurita kmeans --name grouping + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,grouping
5.1,3.5,1.4,0.2,setosa,0
4.9,3.0,1.4,0.2,setosa,0
4.7,3.2,1.3,0.2,setosa,0
4.6,3.1,1.5,0.2,setosa,0
5.0,3.6,1.4,0.2,setosa,0

Observe that the new cluster label column is called grouping.