.. _filter: filter ====== Filter rows using a logical expression that refers to the values in the columns. Rows that evaluate to ``True`` in the expression are retained and all other rows are discarded. Usage ----- .. code-block:: text gurita filter [-h] EXPR Arguments --------- .. list-table:: :widths: 25 20 10 :header-rows: 1 :class: tight-table * - Argument - Description - Reference * - * ``-h`` * ``--help`` - display help for this command - :ref:`help ` * - ``expression`` - the expression to evaluate - :ref:`expression ` The expression argument is required. If it contains spaces then it ought to be surrounded in quotes on the command line. Simple example -------------- The following example reads data from ``titanic.csv`` and retains only rows where ``embark_town`` is equal to ``Cherbourg``, all other rows are discarded: .. code-block:: text gurita filter 'embark_town == "Cherbourg"' < titanic.csv The string ``'embark_town == "Cherbourg"'`` specifies the filtering expression. Note that the whole expression is inside quotes; this is necessary to ensure that the whole expression is passed as a single entity to Gurita. Columns can be referred to by their name, such as ``embark_town``. Literal categorical values are written as strings inside quotation marks, such as ``"Cherbourg"``. .. _filter_help: Getting help ------------ The full set of command line arguments for ``filter`` can be obtained with the ``-h`` or ``--help`` arguments: .. code-block:: text gurita filter -h .. _filter_expression: Expressions ----------- Row filtering uses the same syntax as the `Pandas data frame query method `_, and generally resembles Python notation. You can refer to column headings (column names) in the input data by name, join multiple sub-expressions together using logical operators ``and`` ``or`` and ``not``, and group sub-expressions with parentheses. Quoting column names with spaces --------------------------------- Features (column names) that contain spaces must be surrounded in back-tick quotes: .. code-block:: text gurita hist --filter '`top score` > 1000' -x time game.csv In the above example, the column (column name) ``top score`` must be surrounded in back-tick quotes because it contains a whitespace character. Filtering on numerical columns ------------------------------- Numerical columns can be compared for equality (``==``), inequality (``!=``) and ordering (less than ``<``, greater than ``>``, less than or equals ``<=``, greater than or equals ``>=``). Numerical literals are written using Python syntax. In the example below rows where the ``fare`` column is less than or equal to ``100`` are retained and all others are discarded: .. code-block:: text gurita filter 'fare <= 100' < titanic.csv Filtering on categorical columns --------------------------------- Categorical literals (i.e. strings but not booleans) are written as quoted strings. In the example below all rows where ``who`` is not equal to ``child`` are retained. Or, in other words, all rows relating to children are discareded and only adults are retained. .. code-block:: text gurita filter 'who != "child"' < titanic.csv Note that the categorical value ``"child"`` is written inside quotes. Filtering on boolean columns ----------------------------- Boolean literals are written with a capital first letter, as they are done in Python. Note that boolean literals are not quoted. .. code-block:: text gurita filter 'adult_male == True' < titanic.csv In the example above, only rows where ``adult_male`` is ``True`` are retained. .. note:: It is redundant to compare boolean columns to literal truth values. Therefore the following two commands have the same behaviour: .. code-block:: text gurita filter 'adult_male == True' < titanic.csv .. code-block:: text gurita filter 'adult_male' < titanic.csv Boolean columns can be negated with ``not``: .. code-block:: text gurita filter 'not adult_male' < titanic.csv In the example above only rows where ``adult_male`` is ``False`` will be retained. Comparing columns ------------------ Filter expressions can compare values from different columns, assuming they have a compatible type (for example, numerical columns may only be compared to other numerical columns, and so forth). .. code-block:: text gurita filter 'sepal_length > petal_length' < iris.csv In the example above only rows in ``iris.csv`` where ``sepal_length`` is greater than ``petal_length`` will be retained. Compound filter expressions --------------------------- Multiple filtering crtieria can be combined into one filter expression by combining sub-expressions with boolean operators ``and`` and ``or``. .. code-block:: text gurita filter 'smoker == "No" and total_bill > 10' < tips.csv In the example above only rows in ``tips.csv`` where the column ``smoker`` is ``"No"`` and the numerical column ``total_bill`` is greater than 10 will be retained. If needed, parentheses can be used to group sub-expressions: .. code-block:: text gurita filter 'smoker == "No" and (total_bill > 10 or day == "Sun")' < tips.csv In the above example, the sub-expression inside the parentheses is evaluated first, before the outer sub-expression.