normtest
Test whether numerical columns differ from a normal distribution.
Specifically this command tests a null hypothesis that data comes from a normal distribution and computes a p-value for that test.
If the p-value is less than or equal to some threshold for significance (e.g. p <= 0.05) then the null-hypothesis can be rejected, concluding that the data is not likely to come from a normal distribution.
Otherwise, if the p-value is greater than the threshold of significance (e.g. p > 0.05) then we cannot reject the null-hypothesis and we may conclude that the data likely comes from a normal distribution.
Usage
gurita normtest [-h] [-c [COLUMN ...]] [-m {dagostino,shapiro}] [-a ALPHA] [-p] [-s]
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
test these numerical columns |
|
|
threshold of significance |
|
|
report p-value in output |
|
|
report test statistic in output |
|
|
method for testing normality |
By default D’Agostino’s K^2 test is applied. However, it is also possible to use the Shapiro-Wilk test instead.
If no columns are specified Gurita will test all numerical columns in the dataset one at a time.
By default the p-value and test statistic are not included in the output, but this can be
overridden with the -p/--pvalue
and -s/--stat
arguments.
Simple example
The following example tests all the numerical columns in iris.csv
independently to see if they each differ from the normal distribution.
gurita normtest < iris.csv
The output of the above command is shown below:
column,is_normal
sepal_length,True
sepal_width,True
petal_length,False
petal_width,False
The output contains two columns:
column
: this is the name of the numerical column of data being testedis_normal
: a boolean value, True if the corresponding column is likely from a normal distribution and False otherwise.
Looking at the above results we see that, under the default parameters of the test, sepal_length
and sepal_width
appear to be normal, but petal_length
and petal_width
do not appear to be normal. Indeed, plotting the histogram for petal_length
or petal_width
suggests that they are multi-modal.
Note that iris.csv
also contains a categoritcal column called species
. The normality test
only applies to numerical columns, therefore non-numerical columns are ignored.
Getting help
The full set of command line arguments for normtest
can be obtained with the -h
or --help
arguments:
gurita normtest -h
Choose columns to test for normality
-c COLUMN [COLUMN...]
--col COLUMN [COLUMN...]
By default normtest
will test all numerical columns for normality, however the --col/-c
option allows you to test only specific columns.
For example the following command tests only the sepal_length
column from the iris.csv
dataset:
gurita normtest -c sepal_length < iris.csv
The output of the above command is as follows:
column,is_normal
sepal_length,True
Multiple specific columns can be tested as the following example demonstrates:
gurita normtest -c sepal_length petal_length < iris.csv
The output of the above command is as follows:
column,is_normal
sepal_length,True
petal_length,False
Note that non-numerical columns are ignored, even if they are specified using --col/-c
.
For example the following command tries to test of the categorical species
column is normal:
gurita normtest -c species < iris.csv
It is not logical to make this test, so the output of the command is just an empty dataset, with only a header row and no data rows:
column,is_normal
Threshold for the p-value
-a ALPHA
--alpha ALPHA
The normtest
command computes a p-value for the null-hypothesis that the data comes from
a normal distribution.
A threshold called alpha
is applied to the p-value to decide whether or not the
data differs from a normal distrbution.
By default alpha is set to 0.05, as is often done by convention.
If the p-value is less than or equal to alpha, the null-hypothesis is rejected and we conclude that the data differs from a normal distribution (i.e. the data is not normally distributed). Alternatively, if the p-value is greater than alpha then we cannot reject the null-hypothesis, and therefore we conclude that the data does not differ from a normal distrbution.
In summary:
p-value <= alpha: data is not normally distributed
p-value > alpha: data is normally distributed
The value of the alpha
threshold can be adjusted using the -a/--alpha
argument:
gurita normtest --alpha 0.06 < iris.csv
The above command produces the following output:
column,is_normal
sepal_length,False
sepal_width,True
petal_length,False
petal_width,False
Note that the sepal_length
column is no longer considered normally distributed when alpha is 0.06 (previously it was considered normal when alpha was equal to the defualt of 0.05). This suggests that the our confidence of sepal_length
being normally distributed is not as strong as it is for
sepal_width
.
Report the p-value and/or the test statistic in the output
-p, --pvalue
-s, --stat
By default Gurita does not show the p-value or the test statistic in the output data.
This behaviour can be changed with -p/--pvalue
, which causes the p-value to be included in the output, and -s/--stat
which causes the test statistic to be included in the output.
gurita normtest --pvalue --stat < iris.csv
The output of the above command is shown below:
column,is_normal,p_value,stat
sepal_length,True,0.05682424941067306,5.735584236235733
sepal_width,True,0.1672407178723714,3.576642160069695
petal_length,False,8.677871269019617e-49,221.33178660723647
petal_width,False,1.991810150572055e-30,136.77701788227716
Choose the method for testing normality
-m {dagostino,shapiro}, --method {dagostino,shapiro}
By default D’Agostino’s K^2 test is applied. However, it is also possible to use the Shapiro-Wilk test instead.
The -m/--method
option allows the method to be specified, and can be either dagostino
(the default) or shapiro
.
The following example illustrates the use of the Shapiro-Wilk test:
gurita normtest --method shapiro < iris.csv
The output of the above command is shown below:
column,is_normal
sepal_length,False
sepal_width,True
petal_length,False
petal_width,False
In this case we can see that under default parameters the sepal_length
column is not considered normally distributed, whereas it was
using the D’Agostino’s K^2 test.