eval

Add new columns to the dataset by evaluating an expression.

On each row the value in the new column is computed using a simple expression language that can refer to the values in other columns on the same row.

The expression language uses a simple syntax that resembles Python.

Warning

Eval is based on the Pandas eval method on dataframes.

The documentation on that method says:

“This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.”

This is a security risk!

Do not use eval if you do not trust the expression supplied as an argument. For example, do not use eval on strings that are input from untrusted sources, such as input to web pages.

Usage

gurita eval [-h] EXPR [EXPR ...]

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

expression

the expression to evaluate

expression

Simple example

The iris.csv dataset contains information about the lengths and widths of iris sepals and petals for a variety of species.

We can look at the first few data rows using the head command:

gurita head < iris.csv

The output is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa

Suppose that we want to add a new column to the dataset called sepal_area that is computed on each row from sepal_length * sepal_height * 0.5 (where * means multiplication).

This can be achieved with the eval command like so:

gurita eval 'sepal_area = sepal_length * sepal_width * 0.5' < iris.csv

Note that the whole expression is inside quotation marks.

In this example we us an expression to compute the value of the new column:

sepal_area = sepal_length * sepal_width * 0.5

The syntax of the expression closely resembles that of the Python programming language.

Note in paricular that a new column is made simply by assigning to it on the left-hand-side of the expression. On the right-hand-side we can refer to the values of other existing columns by using their names, such as sepal_length. We can also refer to numerical constants such as 0.5.

We can see the effect of the above example eval statment by viewing the first few rows of the output like so:

gurita eval 'sepal_area = sepal_length * sepal_width * 0.5' + head < iris.csv

The output of the above command is shown below:

sepal_length,sepal_width,petal_length,petal_width,species,sepal_area
5.1,3.5,1.4,0.2,setosa,8.924999999999999
4.9,3.0,1.4,0.2,setosa,7.3500000000000005
4.7,3.2,1.3,0.2,setosa,7.5200000000000005
4.6,3.1,1.5,0.2,setosa,7.13
5.0,3.6,1.4,0.2,setosa,9.0

As you can see a new column called sepal_area has been added to the data, such that the value on each row is computed from the supplied expression.

Complex example

Suppose we have a dataset in a file called points.csv with numerical columns x1, y1, x2, y2, representing pairs of points in the cartesian plane: (x1, y1) and (x2, y2).

x1,y1,x2,y2
0,0,3,4
10,0,10,0
18,12,-4,55

A new column called dist representing the cartesian distance between the pairs of points can be created with eval like so:

gurita eval 'dist = sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)' < points.csv

Some notable features of this example expression include the use of a mathematical function sqrt, parentheses for grouping sub-expressions, and the use of various mathematical operators +, - and ** (exponentiation).

The output of the above command is the new table below, which has the extra columns called dist:

x1,y1,x2,y2,dist
0,0,3,4,5.0
10,0,10,0,0.0
18,12,-4,55,48.30113870293329

A more detailed description of the expression syntax is provided below.

Getting help

The full set of command line arguments for eval can be obtained with the -h or --help arguments:

gurita eval -h

Expressions

eval EXPR [EXPR ..]

The eval command accepts one or more expression arguments. Each of these specifies how to create a new column in the data.

Each expression creates a new column in the dataset. Therefore each expression must specify the name of the new column in the form of an assignment statment:

new_column_name = expression_right_hand_side

If more than one expression is provided multiple columns will be added, one for each expression. Expressions can even refer to new columns that were added on their left.

In the example below the first expression adds a new column called new1 (assuming the existence of a column old in the input data). The second expression adds another column called new2 that is computed from the row-wise addition of old and new1.

gurita eval 'new1 = old + 5' 'new2 = old + new1' < example.csv

Expression syntax

The eval command is implemented using the Pandas eval method on dataframes.

And therefore the syntax of eval in Gurita is the same as the syntax of the Pandas eval method.

In essence the eval expression syntax is a simplified form of Python. The details of which are described in the Pandas documenation. We give a summary below.

Warning

If the new column name already exists in the dataset it will be overwritten in the output.

Referring to column names

On each row the value in the new column is computed using a simple expression language that can refer to the values in other columns on the same row.

Column names can be written as if they are ordinary variables inside the expression.

In the example from earlier sepal_length and sepal_width are the names of existing columns in the data, and sepal_area is the name given to the new column added to the data:

gurita eval 'sepal_area = sepal_length * sepal_width * 0.5' < iris.csv

Column names that cannot be written like ordinary Python variables must be written inside back-quotes. For instance, column names can have spaces in them, but Python variable names cannot.

For example, a column named average height (note the space in the name) would have to be wrtten as `average height`.

Allowed expressions

  • Parentheses (round brackets for grouping sub-expressions)

  • Constants:
    • strings

    • floating point numbers

    • integers

    • booleans (True, False)

  • Arithmetic operations:
    • multiplication: *

    • addition: +

    • subtraction: -

    • division: /

    • exponentiation: ** (raising to a power)

  • Comparison operations:
    • equality: ==

    • less-than: <

    • less-than-equals: <=

    • greater-than: >

    • greater-than-equals: >=

  • Logical operations:
    • conjunction: and

    • disjuntion: or

    • negation: not

  • Mathematical functions (written as fun(arg)):
    • sin

    • cos

    • exp

    • log (natural log)

    • expm1

    • log1p

    • sqrt

    • sinh

    • cosh

    • tanh

    • arcsin

    • arccos

    • arctan

    • arccosh

    • arcsinh

    • arctanh

    • abs

    • arctan2

    • log10