box

Box (or box-and-whisker) plots show the distribution of values in a numerical column optionally grouped by categorical columns. The distribution of a numerical column is displayed using the inter-quartile range, with outliers shown as separate diamond shaped points.

Usage

gurita box [-h] [-x COLUMN] [-y COLUMN] ... other arguments ...

Arguments

Argument

Description

Reference

-h

display help

help

  • -x COLUMN

  • --xaxis COLUMN

select column for the X axis

X axis

  • -y COLUMN

  • --yaxis COLUMN

select column for the Y axis

Y axis

--orient {v,h}

orientation of plot. v = vertical, h = horizontal. Default: v.

orient

--nooutliers

do not display outlier data

no outliers

--strip

overlay data points using a strip plot

strip

--order VALUE [VALUE ..]

control the order of the plotted boxes

order

--hue COLUMN

colour and/or group columns by hue

hue

--hueorder VALUE [VALUE ...]

order of hue columns

hue order

--logx

log scale X axis

log X axis

--logy

log scale Y axis

log Y axis

--xlim BOUND BOUND

range limit X axis

limit X axis

--ylim BOUND BOUND

range limit Y axis

limit Y axis

--frow COLUMN

column to use for facet rows

facet rows

--fcol COLUMN

column to use for facet columns

facet colums

--fcolwrap INT

wrap the facet column at this width, to span multiple rows

facet wrap

See also

Similar functionality to box plots are provided by:

Box plots are based on Seaborn’s catplot library function, using the kind="box" option.

Simple example

Box plot of the age numerical column from the titanic.csv input file:

gurita box -y age < titanic.csv

The output of the above command is written to box.age.png:

Box plot showing the distribution of age for the titanic data set

The plotted numerical column can be divided into groups based on a categorical column. In the following example the distribution of age is shown for each value in the class column:

gurita box -y age -x class < titanic.csv

The output of the above command is written to box.class.age.png:

Box plot showing the distribution of age for each class in the titanic data set

Getting help

The full set of command line arguments for box plots can be obtained with the -h or --help arguments:

gurita box -h

Selecting columns to plot

-x COLUMN, --xaxis COLUMN
-y COLUMN, --yaxis COLUMN

Box plots can be plotted for numerical columns and optionally grouped by categorical columns.

If no categorical column is specified, a single column box plot will be generated showing the distribution of the numerical column.

Note

By default the orientation of the box plot is vertical. In this scenario the numerical column is specified by -y, and the (optional) categorical column is specified by -x.

However, the orientation of the box plot can be made horizontal using the --orient h argument. In this case the sense of the X and Y axes are swapped from the default, and thus the numerical column is specified by -x, and the (optional) categorical column is specified by -y.

In the following example the distribution of age is shown for each value in the class column, where the boxes are plotted horizontally:

gurita box -x age -y class --orient h < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, shown horizontally

Turn off display of outlier points

Outlier data points are shown in box plots by default as small diamonds. This can be turned off with the --nooutliers option.

This can be particularly useful in conjunction with --strip, because the outlier points will also be shown as circular dots, and it can be confusing to see both displayed at the same time.

gurita box -y age -x class --nooutliers < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, with display of outlier points turned off

Overlay data points using a strip plot

Individual data points can be overlaid on top of the box plot using the --strip option.

gurita box -y age -x class --strip --nooutliers < titanic.csv

Note that in the example above we also turn off the display of outlier points with --nooutliers.

Box plot showing the distribution of age for each class in the titanic data set, with data points overlaid on top as a strip plot, and outliers turned off

Controlling the order of the boxes

--order VALUE [VALUE ...]

By default the order of the categorical columns displayed in the box plot is determined from their occurrence in the input data. This can be overridden with the --order argument, which allows you to specify the exact ordering of columns based on their values.

In the following example the box columns of the class column are displayed in the order of First, Second, Third:

gurita box -y age -x class --order First Second Third < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, shown in a specified order

Colour and/or group columns with hue

--hue COLUMN

Each box can be coloured and optionally subdivided into additional categories with the --hue argument.

The following example generates a box plot showing the distribution of the age of titanic passengers across the three different ticket classes, where each class is coloured differently:

gurita box -y age -x class --hue class < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, grouped by class and coloured by class

In the following example the distribution of age is shown for each value in the class column, and further sub-divided by the sex column:

gurita box -y age -x class --hue sex < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, grouped by class and sex

By default the order of the columns within each hue group is determined from their occurrence in the input data. This can be overridden with the --hueorder argument, which allows you to specify the exact ordering of columns within each hue group, based on their values.

In the following example the sex values are displayed in the order of female, male:

gurita box -y age -x class --hue sex --hueorder female male < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, grouped by class and sex, with ordering specified for sex

It is also possible to use both --order and --hueorder in the same command. For example, the following command controls the order of both the class and sex categorical columns:

gurita box -y age -x class --order First Second Third --hue sex --hueorder female male < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, grouped by class and sex, with ordering specified for class and sex

Log scale

--logx
--logy

The distribution of numerical values can be displayed in log (base 10) scale with --logx and --logy.

It only makes sense to log-scale the numerical axis (and not the categorical axis). Therefore, --logx should be used when numerical columns are selected with -x, and conversely, --logy should be used when numerical columns are selected with -y.

For example, you can display a log scale box plot for the age column grouped by class (when the distribution of age is displayed on the Y axis) like so. Note carefully that the numerical data is displayed on the Y-axis (-y), therefore the --logy argument should be used to log-scale the numerical distribution:

gurita box -y age -x class --logy < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, with Y axis in log scale

Axis range limits

--xlim LOW HIGH
--ylim LOW HIGH

The range of displayed numerical distributions can be restricted with --xlim and --ylim. Each of these flags takes two numerical values as arguments that represent the lower and upper bounds of the range to be displayed.

It only makes sense to range-limit the numerical axis (and not the categorical axis). Therefore, --xlim should be used when numerical columns are selected with -x, and conversely, --ylim should be used when numerical columns are selected with -y.

For example, you can display range-limited range for the age column grouped by class (when the distribution of age is displayed on the Y axis) like so. Note carefully that the numerical data is displayed on the Y-axis (-y), therefore the --ylim argument should be used to range-limit the distribution:

gurita box -y age -x class --ylim 10 30 < titanic.csv
Box plot showing the distribution of age for each class in the titanic data set, with Y axis in log scale

Facets

--frow COLUMN
--fcol COLUMN
--fcolwrap INT

Box plots can be further divided into facets, generating a matrix of box plots, where a numerical value is further categorised by up to 2 more categorical columns.

See the facet documentation for more information on this feature.

The following command creates a faceted box plot where the sex column is used to determine the facet columns:

gurita box -y age -x class --fcol sex < titanic.csv
Box plot showing the mean of age for each class in the titanic data set grouped by class, using sex to determine the plot facets