.. _groupby: groupby ======= Group data using one or more columns as keys and apply one or more aggregating functions to selected columns within each group. Usage ----- .. code-block:: text gurita groupby [-h] [-f FUNCTION [FUNCTION ...]] [-v COLUMN [COLUMN ...]] -k COLUMN [COLUMN ...] Arguments --------- .. list-table:: :widths: 25 20 10 :header-rows: 1 :class: tight-table * - Argument - Description - Reference * - * ``-h`` * ``--help`` - display help for this command - :ref:`help ` * - * ``-k COLUMN [COLUMN...]`` * ``--key COLUMN [COLUMN...]`` - group data using these columns as the key - :ref:`group key ` * - * ``-v COLUMN [COLUMN...]`` * ``--val COLUMN [COLUMN...]`` - apply aggregating functions to each of these numerical columns - :ref:`aggregate values ` * - * ``-f FUNCTION [FUNCTION ...]`` * ``--fun FUNCTION [FUNCTION ...]`` - aggregation function(s) to apply to selected columns in the group - :ref:`aggregating function ` The ``--key`` (or ``-k``) argument is required. All others are optional. Both ``--val`` (or ``-v``) and ``--fun`` (or ``-f``) must be used together, if they are used at all. Simple example -------------- In the following example the ``titanic.csv`` dataset is grouped using the ``embark_town`` column as the key. .. code-block:: text gurita groupby --key embark_town < titanic.csv All rows in the dataset that have the same value in the ``embark_town`` column are grouped together. There are 3 unique towns in that column, so the output has 3 corresponding data rows: .. literalinclude:: example_outputs/titanic.groupby.embark_town.txt :language: none By default the size of the groups is computed and included in the output as an extra column. It is also possible to specify an :ref:`aggregating function ` to apply to the groups. .. _groupby_help: Getting help ------------ The full set of command line arguments for ``groupby`` can be obtained with the ``-h`` or ``--help`` arguments: .. code-block:: text gurita groupby -h .. _groupby_key: Choose columns as the key for grouping data ------------------------------------------- .. code-block:: text -k COLUMN [COLUMN...] --key COLUMN [COLUMN...] One or more columns can be selected as the keys for grouping data using the ``--key`` (or ``-k``) argument. Rows will be grouped together using equality on the values in the key columns. .. note:: Groups are based on rows sharing a unique set of values from the ``--key`` columns. Thus key values are compared and grouped based on their *equality*. It is possible to have a mixture of categorical, numerical and boolean columns as keys. As an example, the ``titanic.csv`` dataset has a ``embark_town`` column with three possible values: ``Cherbourg`` and ``Queenstown`` and ``Southhampton``. The simple example above showed how to compute the size of each of those groups. We can extend this example by grouping on another column. For example, we might like to know how the number of people in each ticket class for each of the towns: .. code-block:: text gurita groupby --key embark_town class < titanic.csv The output of the above command is as follows: .. literalinclude:: example_outputs/titanic.groupby.embark_town.class.txt :language: none Now there are nine groups in the result - we have three unique towns, and each of those towns boarded passengers from three unique ticket classes. We can extend this even further by using ``sex`` as another key column: .. code-block:: text gurita groupby --key embark_town class sex < titanic.csv The output of the above command is as follows: .. literalinclude:: example_outputs/titanic.groupby.embark_town.class.sex.txt :language: none Missing values in the key columns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If the ``--key`` columns contain missing (NA) values then the corresponding group will be dropped. .. _groupby_val: .. _groupby_fun: Aggregate data within groups ---------------------------- .. code-block:: text -f FUNCTION [FUNCTION...] --fun FUNCTION [FUNCTION...] -v COLUMN [COLUMN...] --val COLUMN [COLUMN...] By default Gurita computes the size (number of rows) of each group. However it is also possible to aggregate over particular columns within each group using one or more functions. This is achieved by selecting columns using ``--val`` (or ``-v``) and specifying aggregating functions using ``--fun`` (or ``-f``). .. note:: If you want to aggregate data within groups you must supply both of the ``--val`` and ``--fun`` options in the command. It is not possible to supply only one of these without the other. For example, we could group rows in ``titanic.csv`` by the ``embark_town`` column and compute the mean age of passengers for each group like so: .. code-block:: text gurita groupby --key embark_town --val age --fun mean < titanic.csv The output of the above command is as follows, with the new aggregated column called ``age_mean``: .. literalinclude:: example_outputs/titanic.groupby.embark_town.age.mean.txt :language: none As demonstrated in this example, when aggregating data within groups, new column names are created using the name of the ``--val`` column concatenated with the ``--fun`` name, and separated by an underscore. Extending this further, we could also compute the maximum and minimum age of passengers per town by adding extra functions to the ``--fun`` argument: .. code-block:: text gurita groupby --key embark_town --val age --fun mean max min < titanic.csv The output of the above command is as follows, with extra columns ``age_max`` and ``age_min``: .. literalinclude:: example_outputs/titanic.groupby.embark_town.age.mean.max.min.txt :language: none And yes, there are some unusual fractional ages in the dataset - the minimum age of Cherbourg passengers really is 0.42. It is also possible to specify multiple columns for aggregation: .. code-block:: text gurita groupby --key embark_town --val age fare --fun mean max min < titanic.csv In this scenario each ``--val`` column is aggregated by each ``--fun`` function. In the above example 2 columns are aggregated by 3 functions, yielding 6 additional output columns for each group: .. literalinclude:: example_outputs/titanic.groupby.embark_town.age.fare.mean.max.min.txt :language: none It is also possible to group on multiple columns and aggregate on multiple other columns at the same time, for example: .. code-block:: text gurita groupby --key embark_town class --val age fare --fun mean max min < titanic.csv The above command yields the following output: .. literalinclude:: example_outputs/titanic.groupby.embark_town.class.age.fare.mean.max.min.txt :language: none Allowed aggregation functions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following aggregating functions can be used with ``--fun``: * sample (randomly choose one of the possible values) * size (size of the group) * sum * mean * mad (mean absolute deviation) * median * min * max * prod * std (standard deviation) * var (variance) * sem (standard error of the mean) * skew * quantile (50% quantile)