pca

Perform principal component analysis (PCA) on numerical columns in a dataset.

PCA is a popular dimensionality reduction technique. It transforms data into new coordinates such that most of the variation in the data is described with fewer dimensions than the original data.

The pca command adds new columns to the dataset representing the computed principal components.

PCA does not work with missing values, so rows with missing (NA) values in the analysed columns are automatically removed before PCA is applied.

Usage

gurita pca [-h] [-c NAME [NAME ...]] [--prefix PREFIX] [-n COMPONENTS]

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

  • -c COLUMN [COLUMN...]

  • --col COLUMN [COLUMN...]

apply PCA to these columns

PCA columns

  • -n COMPONENTS

  • --ncomps COMPONENTS

number of principal components to generate (default: 2)

number of components

--prefix PREFIX

choose a column name prefix for new PCA columns (default: pc)

new column name prefix

Simple example

The following command computes the first two principal components of the numerical columns in the iris.csv file:

gurita pca < iris.csv

The output is quite long so we can adjust the command to look at only the first few rows using the head command:

gurita pca + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,pc1,pc2
5.1,3.5,1.4,0.2,setosa,-2.2645417283948928,0.5057039027737834
4.9,3.0,1.4,0.2,setosa,-2.0864255006161576,-0.6554047293691353
4.7,3.2,1.3,0.2,setosa,-2.367950449062523,-0.3184773108472487
4.6,3.1,1.5,0.2,setosa,-2.3041971611520085,-0.5753677125331953
5.0,3.6,1.4,0.2,setosa,-2.388777493505642,0.6747673967025161

Two new columns are added to the output dataset, called pc1 and pc2, representing the first and second principal components of the numerical columns in the input data.

It is useful to visualise the principal components in a scatter plot, which can be achieved by chaining with the scatter command:

gurita pca + scatter -x pc1 -y pc2 < iris.csv

The output of the above command is written to scatter.pc1.pc2.png:

Scatter plot showing comparing principal components pc1 and pc2

Getting help

The full set of command line arguments for pca can be obtained with the -h or --help arguments:

gurita pca -h

Perform PCA on specified columns

-c NAME [NAME ...], --col NAME [NAME ...]

By default, if no column names are specified, PCA is performed on all of the numerical columns in the dataset.

However it is possible to perform PCA on a specific subset of columns via the -c/--col argument.

For example, the following command performs PCA on just the columns sepal_length, sepal_width, and petal_length (and hence ignores the petal_width column):

gurita pca -c sepal_length sepal_width petal_length < iris.csv

Note

Non-numeric columns will be ignored by pca even if they are specified as arguments to -c/--col.

Choose the number of principal components to generate

-n COMPONENTS, --ncomps COMPONENTS

By default the pca command will computer the first two principal components on the input dataset.

This can be adjusted using the -n/--ncomps argument.

For example, the following command generates the first three principal components of all the numerical columns in the iris.csv file:

gurita pca -n 3 < iris.csv

The output is quite long so we can adjust the command to look at only the first few rows using the head command:

gurita pca -n 3 + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,pc1,pc2,pc3
5.1,3.5,1.4,0.2,setosa,-2.2645417283948928,0.5057039027737834,-0.12194334778175248
4.9,3.0,1.4,0.2,setosa,-2.0864255006161576,-0.6554047293691353,-0.2272508323992485
4.7,3.2,1.3,0.2,setosa,-2.367950449062523,-0.3184773108472487,0.05147962364496831
4.6,3.1,1.5,0.2,setosa,-2.3041971611520085,-0.5753677125331953,0.09886044443740284
5.0,3.6,1.4,0.2,setosa,-2.388777493505642,0.6747673967025161,0.021427848973115345

Note

Let N be the number of rows and C be the number of numerical columns considered in a PCA.

Let M = minimum(N, C).

The number of principal components must be <= M.

An error message will be generated if this condition is not met.

For example, there are 150 rows and 4 numerical columns in the iris.csv. Therefore a PCA applied to all the numerical columns must not request more than 4 principal components, because 4 is the minimum(150, 4). Thus the following command will generate an error:

gurita pca -n 5 < iris.csv

Choose a column name prefix for new PCA columns

--prefix PREFIX

The pca command adds extra numerical columns to the dataset to store the values for the computed principal components.

The principal components are integers from 1 upwards (1, 2, 3, …). The names of these extra columns are constructed by adding the preix pca on to the number of the component, for example pc1 for the first component, pc2 for the second component and so on.

This can be changed with the --prefix argument.

The following command specifies that comp should be used as the prefix for the newly added columns:

gurita pca --prefix comp < iris.csv

By chaining this command with head we can inspect the first few rows of the output:

gurita pca --prefix comp + head < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species,comp1,comp2
5.1,3.5,1.4,0.2,setosa,-2.2645417283948928,0.5057039027737834
4.9,3.0,1.4,0.2,setosa,-2.0864255006161576,-0.6554047293691353
4.7,3.2,1.3,0.2,setosa,-2.367950449062523,-0.3184773108472487
4.6,3.1,1.5,0.2,setosa,-2.3041971611520085,-0.5753677125331953
5.0,3.6,1.4,0.2,setosa,-2.388777493505642,0.6747673967025161

Observe that the two new principal component columns are called comp1 and comp2 respectively.