normtest

Test whether numerical columns differ from a normal distribution.

Specifically this command tests a null hypothesis that data comes from a normal distribution and computes a p-value for that test.

If the p-value is less than or equal to some threshold for significance (e.g. p <= 0.05) then the null-hypothesis can be rejected, concluding that the data is not likely to come from a normal distribution.

Otherwise, if the p-value is greater than the threshold of significance (e.g. p > 0.05) then we cannot reject the null-hypothesis and we may conclude that the data likely comes from a normal distribution.

Usage

gurita normtest [-h] [-c [COLUMN ...]] [-m {dagostino,shapiro}] [-a ALPHA] [-p] [-s]

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

  • -c COLUMN [COLUMN...]

  • --col COLUMN [COLUMN...]

test these numerical columns

test specific columns

  • -a ALPHA

  • --alpha ALPHA

threshold of significance

threshold

  • -p

  • --pvalue

report p-value in output

p-value

  • -s

  • --stat

report test statistic in output

statistic

  • -m {dagostino,shapiro}

  • --method {dagostino,shapiro}

method for testing normality

method

By default D’Agostino’s K^2 test is applied. However, it is also possible to use the Shapiro-Wilk test instead.

If no columns are specified Gurita will test all numerical columns in the dataset one at a time.

By default the p-value and test statistic are not included in the output, but this can be overridden with the -p/--pvalue and -s/--stat arguments.

Simple example

The following example tests all the numerical columns in iris.csv independently to see if they each differ from the normal distribution.

gurita normtest < iris.csv

The output of the above command is shown below:

column,is_normal
sepal_length,True
sepal_width,True
petal_length,False
petal_width,False

The output contains two columns:

  1. column: this is the name of the numerical column of data being tested

  2. is_normal: a boolean value, True if the corresponding column is likely from a normal distribution and False otherwise.

Looking at the above results we see that, under the default parameters of the test, sepal_length and sepal_width appear to be normal, but petal_length and petal_width do not appear to be normal. Indeed, plotting the histogram for petal_length or petal_width suggests that they are multi-modal.

Note that iris.csv also contains a categoritcal column called species. The normality test only applies to numerical columns, therefore non-numerical columns are ignored.

Getting help

The full set of command line arguments for normtest can be obtained with the -h or --help arguments:

gurita normtest -h

Choose columns to test for normality

-c COLUMN [COLUMN...]
--col COLUMN [COLUMN...]

By default normtest will test all numerical columns for normality, however the --col/-c option allows you to test only specific columns.

For example the following command tests only the sepal_length column from the iris.csv dataset:

gurita normtest -c sepal_length < iris.csv

The output of the above command is as follows:

column,is_normal
sepal_length,True

Multiple specific columns can be tested as the following example demonstrates:

gurita normtest -c sepal_length petal_length < iris.csv

The output of the above command is as follows:

column,is_normal
sepal_length,True
petal_length,False

Note that non-numerical columns are ignored, even if they are specified using --col/-c. For example the following command tries to test of the categorical species column is normal:

gurita normtest -c species < iris.csv

It is not logical to make this test, so the output of the command is just an empty dataset, with only a header row and no data rows:

column,is_normal

Threshold for the p-value

-a ALPHA
--alpha ALPHA

The normtest command computes a p-value for the null-hypothesis that the data comes from a normal distribution.

A threshold called alpha is applied to the p-value to decide whether or not the data differs from a normal distrbution.

By default alpha is set to 0.05, as is often done by convention.

If the p-value is less than or equal to alpha, the null-hypothesis is rejected and we conclude that the data differs from a normal distribution (i.e. the data is not normally distributed). Alternatively, if the p-value is greater than alpha then we cannot reject the null-hypothesis, and therefore we conclude that the data does not differ from a normal distrbution.

In summary:

  • p-value <= alpha: data is not normally distributed

  • p-value > alpha: data is normally distributed

The value of the alpha threshold can be adjusted using the -a/--alpha argument:

gurita normtest --alpha 0.06 < iris.csv

The above command produces the following output:

column,is_normal
sepal_length,False
sepal_width,True
petal_length,False
petal_width,False

Note that the sepal_length column is no longer considered normally distributed when alpha is 0.06 (previously it was considered normal when alpha was equal to the defualt of 0.05). This suggests that the our confidence of sepal_length being normally distributed is not as strong as it is for sepal_width.

Report the p-value and/or the test statistic in the output

-p, --pvalue
-s, --stat

By default Gurita does not show the p-value or the test statistic in the output data.

This behaviour can be changed with -p/--pvalue, which causes the p-value to be included in the output, and -s/--stat which causes the test statistic to be included in the output.

gurita normtest --pvalue --stat < iris.csv

The output of the above command is shown below:

column,is_normal,p_value,stat
sepal_length,True,0.05682424941067306,5.735584236235733
sepal_width,True,0.1672407178723714,3.576642160069695
petal_length,False,8.677871269019617e-49,221.33178660723647
petal_width,False,1.991810150572055e-30,136.77701788227716

Choose the method for testing normality

-m {dagostino,shapiro}, --method {dagostino,shapiro}

By default D’Agostino’s K^2 test is applied. However, it is also possible to use the Shapiro-Wilk test instead.

The -m/--method option allows the method to be specified, and can be either dagostino (the default) or shapiro.

The following example illustrates the use of the Shapiro-Wilk test:

gurita normtest --method shapiro < iris.csv

The output of the above command is shown below:

column,is_normal
sepal_length,False
sepal_width,True
petal_length,False
petal_width,False

In this case we can see that under default parameters the sepal_length column is not considered normally distributed, whereas it was using the D’Agostino’s K^2 test.