gmm (Gaussian mixture model)

The Gaussian mixture model (GMM) identifies clusters in numerical data by finding a mixture of Gaussian probability distributions that best model the data.

The number of clusters to detect is specified as an optional parameter (default is 2).

The gmm command adds a new column to the dataset storing the cluster label for the corresponding datapoint on each row.

Clustering does not work with missing values, so rows with missing (NA) values in the clustered columns are automatically removed before clustering.

Usage

gurita gmm [-h] [-c COLUMN [COLUMN ...]] [--name NAME] [-n NCLUSTERS] [--maxiter MAXITER]

Arguments

Argument	Description	Reference
`-h` `--help`	display help for this command	help
`-c COLUMN [COLUMN...]` `--col COLUMN [COLUMN...]`	apply clustering to specified numerical columns	gmm columns
`-n NCLUSTERS` `--nclusters NCLUSTERS`	identify this many clusters in the data	number of clusters
`--name NAME`	choose a column name for the new cluster label column (default: cluster)	new column name
`--maxiter MAXITER`	Number of expectation maximisation iterations (default: 100)	EM iterations

Simple example

The following command clusters the numerical columns in the iris.csv file:

gurita gmm < iris.csv

The output is quite long so we can adjust the command to look at only the first few rows using the head command:

gurita gmm + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,cluster
1,3.5,1.4,0.2,setosa,1
9,3.0,1.4,0.2,setosa,1
7,3.2,1.3,0.2,setosa,1
6,3.1,1.5,0.2,setosa,1
0,3.6,1.4,0.2,setosa,1

A new categorical column called cluster is added to the dataset, this holds the cluster labels for the datapoint on each row.

Each cluster is labelled using a natural number (0,1,2 …).

We can get an overview of the new cluster column by using the describe command after clustering:

gurita gmm + describe -c cluster < iris.csv

The output of the above command is shown below:

        cluster
count       150
unique        2
top           0
freq        100

We can see that there are 150 data points (150 rows) and 2 unique values in the cluster column (these are the labels 0 and 1). The most frequent label is 1 which occurs 100 times (and thus the label 0 must occur 150-100=50 times).

Note

Despite the use of numbers for cluster labels, Gurita treats them as categorical values.

This is beneficial when it comes to plotting data using cluster labels because it means that the plots will correctly interpret the labels as catergorical values and render them accordingly.

For example we might like to make a box plot comparing the petal_length across the two clusters:

gurita gmm + box -x cluster -y petal_length < iris.csv

The output of the above command is written to box.cluster.petal_length.png:

Box plot comparing petal length across two GMM clusters in the iris.csv dataset

Getting help

The full set of command line arguments for gmm can be obtained with the -h or --help arguments:

gurita gmm -h

Cluster data from specified numerical columns

-c NAME [NAME ...], --col NAME [NAME ...]

By default, if no column names are specified, clustering is performed on all of the numerical columns in the dataset.

However it is possible to perform clustering on a specific subset of columns via the -c/--col argument.

For example, the following command performs GMM clustering on just the columns sepal_length, sepal_width, and petal_length (and hence ignores the petal_width column):

gurita gmm -c sepal_length sepal_width petal_length < iris.csv

Note

Non-numeric columns will be ignored by gmm even if they are specified as arguments to -c/--col.

Choose number of clusters to identify

-n NCLUSTERS, --nclusters NCLUSTERS

By default gmm identifies two clusters in the data. However, this can be changed with the -n/--nclusters argument.

For example, the following command finds three clusters in the iris.csv file:

gurita gmm -n 3 < iris.csv

We can check the number of values in each cluster using the grouby command:

gurita gmm -n 3 + groupby -k cluster < iris.csv

The output of the above command is shown below:

cluster,size
0,45
1,50
2,55

We can observe three clusters labelled 0,1,2 with 50,55,45 members respectively.

Choose a name for the new cluster label column

--name NAME

The gmm command adds an extra categorical column called cluster to the dataset to store the cluster labels for each row.

The cluster labels are natural numbers (non-negative integers) from 0 upwards (0, 1, 2, …).

The name of the extra column can be changed with the --name argument.

The following command specifies that grouping should be used as the name of the newly added column:

gurita gmm --name grouping < iris.csv

By chaining this command with head we can inspect the first few rows of the output:

gurita gmm --name grouping + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,grouping
1,3.5,1.4,0.2,setosa,1
9,3.0,1.4,0.2,setosa,1
7,3.2,1.3,0.2,setosa,1
6,3.1,1.5,0.2,setosa,1
0,3.6,1.4,0.2,setosa,1

Observe that the new cluster label column is called grouping.

Choose the maximum number of expectation maximisation iterations to perform

--maxiter MAXITER

The maximum number of expectation maximisation iterations defaults to 100, it can be changed with the argument --maxiter.

For example, the following command sets the maximum number of expectation maximisation iterations to 1000:

gurita gmm --maxiter 1000 < iris.csv