Writing output data

Gurita can write data to a named output file or standard output (stdout).

Output to a named file

The out command allows you to specify an output file by name:

gurita ... + out newfile.csv

As before, we use ... to indicate that part of the example Gurita command is omitted for the sake of simplifying the discussion.

You should imagine that ... would be replaced by more text to complete the command. For example, the following command reads the file titanic.csv from standard input and then saves the header row and first ten data rows to an output file called newfile.csv:

cat titanic.csv | gurita head 10 + out newfile.csv

Again we see command chaining in action, where the first command head 10 transforms the input data before it is passed along to the out newfile.csv command.

When writing output to a named file (and not to standard output) Gurita will look at the file extension and assume CSV format if the extension is .csv and TSV format if the extension is .tsv. This behaviour can be overridden with the --sep <str> option as noted below.

Warning

When writing to a named file, if the file already exists, the out command will overwrite its contents. The original contents of the file will be lost.

Specifying the field separator explicitly

The out command lets you specify the field separator of the output file explicitly. This will override any default behaviour that Gurita would otherwise have when determining the output file type. This mimics the behaviour of the in command for reading input from files, as discussed previously.

Request for CSV file format:

gurita ... out --sep ',' ...

Note: comma is the default separator used by Gurita, so you don’t need to specify it explicitly. Therefore the above example is redundant, and only shown for the sake of illustration.

Request for TSV file format:

gurita ... out --sep '\t' ...

Be sure to enclose the \t in quote characters to ensure that it is treated as a single string.

Elaborating the above example to a full command:

cat titanic.csv | gurita head 10 + out --sep '\t' example_filename

Using a vertical bar character as the field separator:

gurita ... out --sep '|' ...

Note

out --sep <str> forces Gurita to use the specified field separator regardless of the filename extension or contents of the file.

Output to standard output (stdout)

If you don’t specify a file name when using the out command Gurita will assume that the output should be written to standard output (stdout):

gurita ... out

For example:

cat titanic.csv | gurita head 10 + out

Writing to standard output is particularly useful when you want to use Gurita as part of a command pipeline:

gurita ... + out | example_command

Here example_command is supposed to represent an arbitrary command, possibly itself a series of commands piped together, whose input comes from the standard output of Gurita.

Note

Standard output defaults to CSV format

When writing to standard output, unless otherwise specified, Gurita assumes that the file is in CSV format.

This can be overridden by out --sep <str>

As previously noted, when writing to a named file Gurita will try to use the file name extension to determine the file format, avoiding the need to specify --sep.

Implicit CSV output to standard output (stdout)

In some circumstances, for convenience, Gurita will implicitly write the final state of the data to standard output. It chooses to do this in precisely two circumstances, when the last command in a chain is either:

  • a data transformation

  • an input command (including implicitly reading from standard input)

However, Gurita will not implicitly write the final state of the data to standard output when the last command in a chain is either:

  • a plotting command

  • a data summary command

  • an out command

The logic for this behaviour is as follows.

If the last command in a chain is a transformation or just an input command, Gurita assumes that you must have read/transformed the data for a reason and you probably want to save/use the result. If a command chain does not explicitly end with an out command the final state of the data would be lost. So Gurita writes it to standard output in CSV format for you.

If the last command in a chain is a plotting command, then Gurita assumes that your main purpose must have been to generate the plot, and therefore you are not interested in saving/using the final state of the data. Similarly for situations when the last command shows summary information about the data, such as pretty. If you want to make a plot or see summary information and save the final state of the data you can always achieve this by ending a chain with an explicit out command.

Therefore:

gurita <transformation or input command> + out

can be simplified to:

gurita <transformation or input command>

As a concrete example, the following command:

cat titanic.csv | gurita head 10 + out

can be simplified to:

cat titanic.csv | gurita head 10

or, of course, you could achieve the same result with input redirection, again dropping the + out from the original command:

gurita head 10 < titanic.csv

Note carefully that when implicitly writing to standard output Gurita will always assume the output file should be written in CSV format. If you want to read a different format from standard input you must explicitly specify the type using: out --sep <str> ...

Writing output to more than one file in a command chain

You may write output to more than one file in a command chain, both to named files and standard output.

For example you can do something like this:

gurita ... out newfile1.csv + ... + out newfile2.tsv + ...

In the above example, output is written to newfile1.csv in CSV format, then some unspecified commands are run, and later output is written to newfile2.tsv in TSV format, and some more unspecified commands are run.

Each invocation of out causes the current data set to be written to a file or standard output. When used in a chain of commands out also passes the current data set along unmodified to the next command in the chain. This allows the data to be passed along from left to right in the chain with further processing of the data occurring after the out command has been executed.

This is most useful when you want to save different states of the data as it undergoes various transformations in a command chain.

Warning: if you have two separate out commands that write to the same named output file, the last occurrence will overwrite any earlier output that that file.

However, if multiple different writes to standard output are used, they will form a single concatenated stream of data.

The following command is a more concrete example:

gurita in iris.csv + sample 0.6 + out samp.csv + cut -c sepal_length + out len.tsv

There are five parts to the above command chain:

  1. input is read from the iris.csv file, this becomes the current data set

  2. 60% of the data rows in the current data set are randomly sampled, the remaining 40% of the rows are discareded

  3. the current (sampled) data set is written to the output file samp.csv in CSV format

  4. the sepal_length column is selected from the current (sampled) data set and the remaining columns are discareded

  5. the final (cut and sampled) data set is written to the output file len.tsv in TSV format

Converting between TSV and CSV formats

Gurita can read and write data in both CSV and TSV formats. Therefore, one simple but useful thing it can easily do is convert data files between those formats. Notably, in such conversions it will handle corner cases correctly, such as proper quotation of data values, and appropriate formatting of missing (NA) values.

For example, the following commands all convert the iris.csv file (in CSV format) into TSV format, and save the result in a file called iris.tsv:

gurita in iris.csv + out iris.tsv
cat iris.csv | gurita out iris.tsv
cat iris.csv | gurita out --sep '\t' > iris.tsv

Conversely, the following commands all convert the iris.tsv file (in TSV format) into CSV format, and save the result in a file called iris.csv:

gurita in iris.tsv + out iris.csv
cat iris.tsv | gurita in --sep '\t' + out iris.csv
cat iris.tsv | gurita in --sep '\t' + out > iris.csv

Note that in the above example there is no need to specify that the output file is in CSV format because that is the default behaviour of the out command.