filter
Filter rows using a logical expression that refers to the values in the columns.
Rows that evaluate to True
in the expression are retained and all other rows are discarded.
Usage
gurita filter [-h] EXPR
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
the expression to evaluate |
The expression argument is required. If it contains spaces then it ought to be surrounded in quotes on the command line.
Simple example
The following example reads data from titanic.csv
and retains only rows where embark_town
is equal to Cherbourg
, all other rows are discarded:
gurita filter 'embark_town == "Cherbourg"' < titanic.csv
The string 'embark_town == "Cherbourg"'
specifies the filtering expression.
Note that the whole expression is inside quotes; this is necessary to ensure that the whole expression is passed as a single entity to Gurita.
Columns can be referred to by their name, such as embark_town
. Literal categorical values are written as strings inside quotation marks, such as "Cherbourg"
.
Getting help
The full set of command line arguments for filter
can be obtained with the -h
or --help
arguments:
gurita filter -h
Expressions
Row filtering uses the same syntax as the Pandas data frame query method, and generally resembles Python notation.
You can refer to column headings (column names) in the input data by name, join multiple sub-expressions together using logical operators and
or
and not
, and group sub-expressions with parentheses.
Quoting column names with spaces
Features (column names) that contain spaces must be surrounded in back-tick quotes:
gurita hist --filter '`top score` > 1000' -x time game.csv
In the above example, the column (column name) top score
must be surrounded in back-tick quotes because it contains a whitespace character.
Filtering on numerical columns
Numerical columns can be compared for equality (==
), inequality (!=
) and ordering (less than <
, greater than >
, less than or equals <=
, greater than or equals >=
). Numerical literals are written using Python syntax.
In the example below rows where the fare
column is less than or equal to 100
are retained and all others are discarded:
gurita filter 'fare <= 100' < titanic.csv
Filtering on categorical columns
Categorical literals (i.e. strings but not booleans) are written as quoted strings.
In the example below all rows where who
is not equal to child
are retained. Or, in other words,
all rows relating to children are discareded and only adults are retained.
gurita filter 'who != "child"' < titanic.csv
Note that the categorical value "child"
is written inside quotes.
Filtering on boolean columns
Boolean literals are written with a capital first letter, as they are done in Python. Note that boolean literals are not quoted.
gurita filter 'adult_male == True' < titanic.csv
In the example above, only rows where adult_male
is True
are retained.
Note
It is redundant to compare boolean columns to literal truth values.
Therefore the following two commands have the same behaviour:
gurita filter 'adult_male == True' < titanic.csv
gurita filter 'adult_male' < titanic.csv
Boolean columns can be negated with not
:
gurita filter 'not adult_male' < titanic.csv
In the example above only rows where adult_male
is False
will be retained.
Comparing columns
Filter expressions can compare values from different columns, assuming they have a compatible type (for example, numerical columns may only be compared to other numerical columns, and so forth).
gurita filter 'sepal_length > petal_length' < iris.csv
In the example above only rows in iris.csv
where sepal_length
is greater than petal_length
will be retained.
Compound filter expressions
Multiple filtering crtieria can be combined into one filter expression by combining sub-expressions with boolean operators and
and or
.
gurita filter 'smoker == "No" and total_bill > 10' < tips.csv
In the example above only rows in tips.csv
where the column smoker
is "No"
and the numerical column total_bill
is greater than 10 will be retained.
If needed, parentheses can be used to group sub-expressions:
gurita filter 'smoker == "No" and (total_bill > 10 or day == "Sun")' < tips.csv
In the above example, the sub-expression inside the parentheses is evaluated first, before the outer sub-expression.