hist (histogram)

Plot distributions of selected numerical or categorical columns as histograms.

Usage

gurita hist [-h] [-x COLUMN] [-y COLUMN] ... other arguments ...

Arguments

Argument	Description	Reference
`-h`	display help	help
`-x COLUMN` `--xaxis COLUMN`	select column for the X axis	X axis
`-y COLUMN` `--yaxis COLUMN`	select column for the Y axis	Y axis
`--bins NUM`	number of bins	bins
`--binwidth NUM`	width of bins, overrides `--bins`	bin width
`--cumulative`	plot a cumulative histogram	cumulative
`--hue COLUMN`	group columns by hue	hue
`--stat {count, frequency, probability, proportion, percent, density}`	Statistic to use for each bin (default: count)	stat
`--indnorm`	normalise each histogram in the plot independently	independent normalisation
`--kde`	overlay a kernel density estimate (kde) as a line	kernel density estimation
`--nofill`	use unfilled histogram bars instead of solid coloured bars	no fill
`--element {bars,step,poly}`	style of histogram bars (default is bars)	element
`--logx`	log scale X axis	log X axis
`--logy`	log scale Y axis	log Y axis
`--xlim BOUND BOUND`	range limit X axis	limit X axis
`--ylim BOUND BOUND`	range limit Y axis	limit Y axis
`--frow COLUMN`	column to use for facet rows	facet rows
`--fcol COLUMN`	column to use for facet columns	facet columns
`--fcolwrap INT`	wrap the facet column at this width, to span multiple rows	facet wrap

Simple examples

Plot a histogram of the tip amount from the tips.csv input file:

gurita hist -x tip < tips.csv

The output of the above command is written to hist.tip.png:

Histogram plot showing the distribution of tip amounts for the tips data set

Plot a count of the different categorical values in the day column:

gurita hist -x day < tips.csv

The output of the above command is written to hist.day.png:

Histogram plot showing the count of the different categorical values in the day column

Getting help

The full set of command line arguments for histograms can be obtained with the -h or --help arguments:

gurita hist -h

Selecting columns to plot

-x COLUMN, --xaxis COLUMN
                      Feature to plot along the X axis
-y COLUMN, --yaxis COLUMN
                      Feature to plot along the Y axis

Histograms can be plotted for both numerical columns and for categorical columns. Numerical data is binned and the histogram shows the counts of data points per bin. Catergorical data is shown as a count plot with a column for each categorical value in the specified column.

You can select the column that you want to plot as a histogram using the -x/--xaxis or -y/--yaxis arguments.

If -x is chosen the histogram columns will be plotted vertically.

If -y is chosen the histogram columns will be plotted horizontally.

If both -x and -y are both specified then a heatmap will be plotted.

See the example above for a vertical axis plot. For comparison, the following command uses -y tip to plot a histogram of tip horizontally:

gurita hist -y tip < tips.csv

Histogram of two columns (bivariate heatmaps)

Bivariate histograms (two columns) can be plotted by specifying both -x and -y.

In the following example the distribution of tip is compared to the distribution of total_bill. The result is shown as a heatmap:

gurita hist -x tip -y total_bill < tips.csv

Bivariate histogram plot showing the distribution of tip against total_bill

Bivariate histograms also work with categorical variables and combinations of numerical and categorical variables.

Number of bins

For numerical columns, by default gurita will try to automatically pick an appropriate number of bins for the selected column.

However, this can be overridden by specifying the required number of bins to use with the --bins argument like so:

gurita hist -x tip --bins 5 < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set, using 5 bins

Width of bins

For numerical columns, by default gurita will try to automatically pick an appropriate bin width for the selected column.

However, this can be overridden by specifying the required bin width to use with the --binwidth argument like so:

gurita hist -x tip --binwidth 3 < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set, using bins of width 3

Note that --binwidth overrides the --bins parameter.

Cumulative histograms

Cumulative histograms can be plotted with the --cumulative argument.

gurita hist -x tip --cumulative < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set in cumulative style

Show distributions of categorical subsets using hue

--hue COLUMN

The distribution of categorical subsets of the data can be shown with the --hue argument.

In the following example the distribution of distribution of the tip column is divided into two subsets based on the categorical smoker column. Each subset is plotted as its own histogram, layered on top of each other:

gurita hist -x tip --hue smoker < tips.csv

Histogram showing the distribution of tip based divided into subsets based on the smoker column

The default behaviour is to layer overlapping histograms on top of each other, as demonstrated in the above plot.

The --multiple parameter lets you choose alternative ways to show overlapping histograms. The example below shows the two histograms stacked on top of each other:

gurita hist -x tip --hue smoker --multiple stack < tips.csv

Histogram showing the distribution of tip based divided into subsets based on the smoker column, with overlapping histograms stacked

The --multiple paramter supports the following values: layer (default), stack, dodge, and fill.

The following example shows the effect of --multiple dodge, where categorical fields are shown next to each other:

gurita hist -x tip --hue smoker --multiple dodge < tips.csv

Histogram showing the distribution of tip based divided into subsets based on the smoker column, with overlapping histograms side-by-side

The following example shows the effect of --multiple fill, where counts are normalised to a proportion, and bars are filled so that all categories sum to 1:

gurita hist -x tip --hue smoker --multiple fill < tips.csv

Histogram showing the distribution of tip based divided into subsets based on the smoker column, with overlapping histograms filled to proportions

Histogram statistic

By default histograms show a count of the number of values in each bin. However this can be changed with the --stat {count,frequency,probability,proportion,percent,density} argument

gurita hist -x tip --stat proportion < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set showing the proportion statistic for each bin

Independent normalised statistics

The --stat argument allows the use of the following normalising statistics:

probability
proportion (same as probability)
percent
density

In plots with mutliple histograms for categorical subsets using --hue, by default these statistics are normalised across the entire dataset. This behaviour can be changed by --indnorm such that the normalisation happens within each categorical subset.

Compare the following plots that show a histograms of the tip column for each value of smoker using a proportion as the statistic.

In the example below the default normalisation occurs, across the entire dataset:

gurita hist -x tip --hue smoker --stat proportion --multiple dodge < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set showing the proportion statistic for each bin and global normalisation

And now the same command as above, but with the --indnorm argument supplied, so that each value of smoker is normalised independently:

gurita hist -x tip --hue smoker --stat proportion --multiple dodge --indnorm < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set showing the proportion statistic for each bin and indepdendent normalisation

Kernel density estimate

A kernel density estimate can be plotted with the --kde argument.

gurita hist -x tip --kde < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set with a kernel density overlaid as a line

Unfilled histogram bars

By default histogram bars are shown with solid filled bars. This can be changed with --nofill which uses unfilled bars instead:

gurita hist -x tip --nofill < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set with unfilled bars

Visual style of univariate histograms

By default univariate histograms are visualised as bars. This can be changed with --element {bars,step,poly} which allows alternative renderings.

The example below shows the step visual style.

gurita hist -x tip --element step < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set using a step visualisation style

The example below shows the poly (polygon) visual style, with vertices in the center of each bin.

gurita hist -x tip --element poly < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set using a polygon visualisation style

Log scale

--logx
--logy

The distribution of numerical values can be displayed in log (base 10) scale with --logx and --logy.

gurita hist -x tip --logy < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set with log scale on the Y axis

Axis range limits

--xlim LOW HIGH
--ylim LOW HIGH

The range of displayed numerical distributions can be restricted with --xlim and --ylim. Each of these flags takes two numerical values as arguments that represent the lower and upper bounds of the range to be displayed.

gurita hist -x tip --xlim 3 8 < tips.csv

Histogram plot showing the distribution of tip amounts for the tips data set with the X asix limited in range from 3 to 8