.. _normtest: normtest ======== Test whether numerical columns differ from a normal distribution. Specifically this command tests a null hypothesis that data comes from a normal distribution and computes a p-value for that test. If the p-value is less than or equal to some threshold for significance (e.g. p <= 0.05) then the null-hypothesis can be rejected, concluding that the data is not likely to come from a normal distribution. Otherwise, if the p-value is greater than the threshold of significance (e.g. p > 0.05) then we cannot reject the null-hypothesis and we may conclude that the data likely comes from a normal distribution. Usage ----- .. code-block:: text gurita normtest [-h] [-c [COLUMN ...]] [-m {dagostino,shapiro}] [-a ALPHA] [-p] [-s] Arguments --------- .. list-table:: :widths: 25 20 10 :header-rows: 1 :class: tight-table * - Argument - Description - Reference * - * ``-h`` * ``--help`` - display help for this command - :ref:`help ` * - * ``-c COLUMN [COLUMN...]`` * ``--col COLUMN [COLUMN...]`` - test these numerical columns - :ref:`test specific columns ` * - * ``-a ALPHA`` * ``--alpha ALPHA`` - threshold of significance - :ref:`threshold ` * - * ``-p`` * ``--pvalue`` - report p-value in output - :ref:`p-value ` * - * ``-s`` * ``--stat`` - report test statistic in output - :ref:`statistic ` * - * ``-m {dagostino,shapiro}`` * ``--method {dagostino,shapiro}`` - method for testing normality - :ref:`method ` By default `D'Agostino's K^2 test `_ is applied. However, it is also possible to use the `Shapiro-Wilk test `_ instead. If no columns are specified Gurita will test all numerical columns in the dataset one at a time. By default the p-value and test statistic are not included in the output, but this can be overridden with the ``-p/--pvalue`` and ``-s/--stat`` arguments. Simple example -------------- The following example tests all the numerical columns in ``iris.csv`` independently to see if they each differ from the normal distribution. .. code-block:: text gurita normtest < iris.csv The output of the above command is shown below: .. literalinclude:: example_outputs/iris.normtest.txt :language: none The output contains two columns: 1. ``column``: this is the name of the numerical column of data being tested 2. ``is_normal``: a boolean value, True if the corresponding column is likely from a normal distribution and False otherwise. Looking at the above results we see that, under the default parameters of the test, ``sepal_length`` and ``sepal_width`` appear to be normal, but ``petal_length`` and ``petal_width`` do not appear to be normal. Indeed, plotting the histogram for ``petal_length`` or ``petal_width`` suggests that they are multi-modal. Note that ``iris.csv`` also contains a categoritcal column called ``species``. The normality test only applies to numerical columns, therefore non-numerical columns are ignored. .. _normtest_help: Getting help ------------ The full set of command line arguments for ``normtest`` can be obtained with the ``-h`` or ``--help`` arguments: .. code-block:: text gurita normtest -h .. _normtest_columns: Choose columns to test for normality ------------------------------------ .. code-block:: text -c COLUMN [COLUMN...] --col COLUMN [COLUMN...] By default ``normtest`` will test all numerical columns for normality, however the ``--col/-c`` option allows you to test only specific columns. For example the following command tests only the ``sepal_length`` column from the ``iris.csv`` dataset: .. code-block:: text gurita normtest -c sepal_length < iris.csv The output of the above command is as follows: .. literalinclude:: example_outputs/iris.normtest.sepal_length.txt :language: none Multiple specific columns can be tested as the following example demonstrates: .. code-block:: text gurita normtest -c sepal_length petal_length < iris.csv The output of the above command is as follows: .. literalinclude:: example_outputs/iris.normtest.sepal_length.petal_length.txt :language: none Note that non-numerical columns are ignored, even if they are specified using ``--col/-c``. For example the following command tries to test of the categorical ``species`` column is normal: .. code-block:: text gurita normtest -c species < iris.csv It is not logical to make this test, so the output of the command is just an empty dataset, with only a header row and no data rows: .. literalinclude:: example_outputs/iris.normtest.species.txt :language: none .. _normtest_threshold: Threshold for the p-value ------------------------- .. code-block:: text -a ALPHA --alpha ALPHA The ``normtest`` command computes a p-value for the null-hypothesis that the data comes from a normal distribution. A threshold called ``alpha`` is applied to the p-value to decide whether or not the data differs from a normal distrbution. By default alpha is set to 0.05, as is often done by convention. If the p-value is less than or equal to alpha, the null-hypothesis is rejected and we conclude that the data differs from a normal distribution (i.e. the data is not normally distributed). Alternatively, if the p-value is greater than alpha then we cannot reject the null-hypothesis, and therefore we conclude that the data does not differ from a normal distrbution. In summary: * p-value <= alpha: data *is not* normally distributed * p-value > alpha: data *is* normally distributed The value of the ``alpha`` threshold can be adjusted using the ``-a/--alpha`` argument: .. code-block:: text gurita normtest --alpha 0.06 < iris.csv The above command produces the following output: .. literalinclude:: example_outputs/iris.normtest.alpha.txt :language: none Note that the ``sepal_length`` column is no longer considered normally distributed when alpha is 0.06 (previously it was considered normal when alpha was equal to the defualt of 0.05). This suggests that the our confidence of ``sepal_length`` being normally distributed is not as strong as it is for ``sepal_width``. .. _normtest_pvalue: .. _normtest_statistic: Report the p-value and/or the test statistic in the output ---------------------------------------------------------- .. code-block:: text -p, --pvalue -s, --stat By default Gurita does not show the p-value or the test statistic in the output data. This behaviour can be changed with ``-p/--pvalue``, which causes the p-value to be included in the output, and ``-s/--stat`` which causes the test statistic to be included in the output. .. code-block:: text gurita normtest --pvalue --stat < iris.csv The output of the above command is shown below: .. literalinclude:: example_outputs/iris.normtest.pvalue.stat.txt :language: none .. _normtest_method: Choose the method for testing normality --------------------------------------- .. code-block:: text -m {dagostino,shapiro}, --method {dagostino,shapiro} By default `D'Agostino's K^2 test `_ is applied. However, it is also possible to use the `Shapiro-Wilk test `_ instead. The ``-m/--method`` option allows the method to be specified, and can be either ``dagostino`` (the default) or ``shapiro``. The following example illustrates the use of the Shapiro-Wilk test: .. code-block:: text gurita normtest --method shapiro < iris.csv The output of the above command is shown below: .. literalinclude:: example_outputs/iris.normtest.method.shapiro.txt :language: none In this case we can see that under default parameters the ``sepal_length`` column is not considered normally distributed, whereas it was using the D'Agostino's K^2 test.