eval
Add new columns to the dataset by evaluating an expression.
On each row the value in the new column is computed using a simple expression language that can refer to the values in other columns on the same row.
The expression language uses a simple syntax that resembles Python.
Warning
Eval is based on the Pandas eval method on dataframes.
The documentation on that method says:
“This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.”
This is a security risk!
Do not use eval
if you do not trust the expression supplied as an argument. For example, do not use eval
on strings that
are input from untrusted sources, such as input to web pages.
Usage
gurita eval [-h] EXPR [EXPR ...]
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
the expression to evaluate |
Simple example
The iris.csv
dataset contains information about the lengths and widths of iris sepals and petals for a variety of species.
We can look at the first few data rows using the head
command:
gurita head < iris.csv
The output is as follows:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
Suppose that we want to add a new column to the dataset called sepal_area
that is computed on each row from sepal_length * sepal_height * 0.5
(where *
means multiplication).
This can be achieved with the eval
command like so:
gurita eval 'sepal_area = sepal_length * sepal_width * 0.5' < iris.csv
Note that the whole expression is inside quotation marks.
In this example we us an expression to compute the value of the new column:
sepal_area = sepal_length * sepal_width * 0.5
The syntax of the expression closely resembles that of the Python programming language.
Note in paricular that a new column is made simply by assigning to it on the left-hand-side of the expression. On the right-hand-side we can refer to the values of other existing
columns by using their names, such as sepal_length
. We can also refer to numerical constants such as 0.5
.
We can see the effect of the above example eval
statment by viewing the first few rows of the output like so:
gurita eval 'sepal_area = sepal_length * sepal_width * 0.5' + head < iris.csv
The output of the above command is shown below:
sepal_length,sepal_width,petal_length,petal_width,species,sepal_area
5.1,3.5,1.4,0.2,setosa,8.924999999999999
4.9,3.0,1.4,0.2,setosa,7.3500000000000005
4.7,3.2,1.3,0.2,setosa,7.5200000000000005
4.6,3.1,1.5,0.2,setosa,7.13
5.0,3.6,1.4,0.2,setosa,9.0
As you can see a new column called sepal_area
has been added to the data, such that the value on each row is computed from the supplied expression.
Complex example
Suppose we have a dataset in a file called points.csv
with numerical columns x1
, y1
, x2
, y2
, representing pairs of points in the cartesian plane: (x1, y1)
and (x2, y2)
.
x1,y1,x2,y2
0,0,3,4
10,0,10,0
18,12,-4,55
A new column called dist
representing the cartesian distance between the pairs of points can be created with eval
like so:
gurita eval 'dist = sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)' < points.csv
Some notable features of this example expression include the use of a mathematical function sqrt
, parentheses for grouping sub-expressions, and the use of various mathematical operators +
, -
and **
(exponentiation).
The output of the above command is the new table below, which has the extra columns called dist
:
x1,y1,x2,y2,dist
0,0,3,4,5.0
10,0,10,0,0.0
18,12,-4,55,48.30113870293329
A more detailed description of the expression syntax is provided below.
Getting help
The full set of command line arguments for eval
can be obtained with the -h
or --help
arguments:
gurita eval -h
Expressions
eval EXPR [EXPR ..]
The eval
command accepts one or more expression arguments. Each of these specifies how to create a new column in the data.
Each expression creates a new column in the dataset. Therefore each expression must specify the name of the new column in the form of an assignment statment:
new_column_name = expression_right_hand_side
If more than one expression is provided multiple columns will be added, one for each expression. Expressions can even refer to new columns that were added on their left.
In the example below the first expression adds a new column called new1
(assuming the existence of a column old
in the input data). The second expression adds another column called new2
that is computed from the row-wise addition of old
and new1
.
gurita eval 'new1 = old + 5' 'new2 = old + new1' < example.csv
Expression syntax
The eval
command is implemented using the Pandas eval method on dataframes.
And therefore the syntax of eval
in Gurita is the same as the syntax of the Pandas eval
method.
In essence the eval
expression syntax is a simplified form of Python.
The details of which are described in the Pandas documenation. We give a summary below.
Warning
If the new column name already exists in the dataset it will be overwritten in the output.
Referring to column names
On each row the value in the new column is computed using a simple expression language that can refer to the values in other columns on the same row.
Column names can be written as if they are ordinary variables inside the expression.
In the example from earlier sepal_length
and sepal_width
are the names of existing columns in the data, and sepal_area
is the name
given to the new column added to the data:
gurita eval 'sepal_area = sepal_length * sepal_width * 0.5' < iris.csv
Column names that cannot be written like ordinary Python variables must be written inside back-quotes. For instance, column names can have spaces in them, but Python variable names cannot.
For example, a column named average height
(note the space in the name) would have to be wrtten as `average height`
.
Allowed expressions
Parentheses (round brackets for grouping sub-expressions)
- Constants:
strings
floating point numbers
integers
booleans (True, False)
- Arithmetic operations:
multiplication:
*
addition:
+
subtraction:
-
division:
/
exponentiation:
**
(raising to a power)
- Comparison operations:
equality:
==
less-than:
<
less-than-equals:
<=
greater-than:
>
greater-than-equals:
>=
- Logical operations:
conjunction:
and
disjuntion:
or
negation:
not
- Mathematical functions (written as
fun(arg)
): sin
cos
exp
log
(natural log)expm1
log1p
sqrt
sinh
cosh
tanh
arcsin
arccos
arctan
arccosh
arcsinh
arctanh
abs
arctan2
log10
- Mathematical functions (written as