sort

Sort rows in the dataset based on the values in specified columns.

Usage

gurita sort [-h] -c COLUMN [COLUMN ...] [--napos {first,last}]
            [--order {a,d} [{a,d} ...]]
            [--alg {quicksort,mergesort,heapsort,stable}]

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

  • -c COLUMN [COLUMN...]

  • --col COLUMN [COLUMN...]

sort data by these columns

sort columns

  • -o ORDER [ORDER ...]

  • --order ORDER [ORDER ...]

ascending or descending sort order (default: ascending)

sort order

--alg {quicksort,mergesort,heapsort,stable}

algorithm to use for sort (default: quicksort)

sort algorithm

--napos {first,last}

ordering for missing (NA) values (default: first)

NA order

The -c (--col) argument is required, and all other arguments are optional.

Simple example

Suppose we are working with the following small data set with 10 data rows that is stored in a file called example.csv:

sepal_length,sepal_width,petal_length,petal_width,species
6.3,3.4,5.6,2.4,virginica
6.3,2.5,5.0,1.9,virginica
4.8,3.4,1.9,0.2,setosa
6.3,3.3,4.7,1.6,versicolor
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
6.4,2.8,5.6,2.1,virginica
5.4,3.9,1.7,0.4,setosa
5.9,3.0,4.2,1.5,versicolor
5.2,3.5,1.5,0.2,setosa

The command below sorts the rows of the data based on the values in the sepal_width row in ascending order:

gurita sort -c sepal_width < example.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
6.3,2.5,5.0,1.9,virginica
6.4,2.8,5.6,2.1,virginica
5.9,3.0,4.2,1.5,versicolor
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
6.3,3.3,4.7,1.6,versicolor
6.3,3.4,5.6,2.4,virginica
4.8,3.4,1.9,0.2,setosa
5.2,3.5,1.5,0.2,setosa
5.4,3.9,1.7,0.4,setosa

After sorting all rows are in ascending order according to their value in the sepal_width column. Categorical columns are sorted in alpha-numeric order, as is common in most applications.

For example, sorting on the categorical species column can be done like so:

gurita sort -c species < example.csv

producing the following output:

sepal_length,sepal_width,petal_length,petal_width,species
4.8,3.4,1.9,0.2,setosa
4.7,3.2,1.3,0.2,setosa
5.4,3.9,1.7,0.4,setosa
5.2,3.5,1.5,0.2,setosa
6.3,3.3,4.7,1.6,versicolor
6.4,3.2,4.5,1.5,versicolor
5.9,3.0,4.2,1.5,versicolor
6.3,3.4,5.6,2.4,virginica
6.3,2.5,5.0,1.9,virginica
6.4,2.8,5.6,2.1,virginica

Getting help

The full set of command line arguments for sort can be obtained with the -h or --help arguments:

gurita sort -h

Selecting columns to use for sorting

-c COLUMN [COLUMN ...], --col COLUMN [COLUMN ...]

The sort command requires one or more columns to be specified.

Multi-column sorting behaves like most spreadsheet applications, where precedence goes from left to right in the order of the specified columns.

The following example sorts first on sepal_width and then on species:

gurita sort -c sepal_width species < example.csv

The relative order of rows that are tied on equal values of sepal_width will be determined by the corresponding values in the species column.

The result of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
6.3,2.5,5.0,1.9,virginica
6.4,2.8,5.6,2.1,virginica
5.9,3.0,4.2,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
6.4,3.2,4.5,1.5,versicolor
6.3,3.3,4.7,1.6,versicolor
4.8,3.4,1.9,0.2,setosa
6.3,3.4,5.6,2.4,virginica
5.2,3.5,1.5,0.2,setosa
5.4,3.9,1.7,0.4,setosa

Observe that the two rows with a sepal_length of 3.2 are sorted based on species such that setosa comes before versicolor. Likewise, the two rows with a sepal_length of 3.4 are sorted based on species such that setosa comes before virginica.

Sorting in ascending or descending order

-o {a,d} [{a,d} ...], --order {a,d} [{a,d} ...]

By default rows are sorted in ascending order according to the values in the specified columns. However, this can be changed to descending order using the -o (or --order) argument. The direction of the ordering is given by the characters a (ascending) and d (descending).

The following command sorts the rows in descending order based on the values in the sepal_width column:

gurita sort -c sepal_width --order d < example.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
5.4,3.9,1.7,0.4,setosa
5.2,3.5,1.5,0.2,setosa
6.3,3.4,5.6,2.4,virginica
4.8,3.4,1.9,0.2,setosa
6.3,3.3,4.7,1.6,versicolor
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
5.9,3.0,4.2,1.5,versicolor
6.4,2.8,5.6,2.1,virginica
6.3,2.5,5.0,1.9,virginica

When sorting on multiple columns the ordering can be specified on a per-column basis.

The following command sorts rows in descending order based on the values in the sepal_width column and, when there are equal ties, it sorts in ascending order on values in the species column:

gurita sort -c sepal_width species --order d a < example.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
5.4,3.9,1.7,0.4,setosa
5.2,3.5,1.5,0.2,setosa
4.8,3.4,1.9,0.2,setosa
6.3,3.4,5.6,2.4,virginica
6.3,3.3,4.7,1.6,versicolor
4.7,3.2,1.3,0.2,setosa
6.4,3.2,4.5,1.5,versicolor
5.9,3.0,4.2,1.5,versicolor
6.4,2.8,5.6,2.1,virginica
6.3,2.5,5.0,1.9,virginica

Note

The columns specified with -c (or --col) and the ordering specified by -o (or --order) match up pairwise.

For example, in the scheme below, the order for column C1 is given by O1, and C2 is ordered by O2, and so forth:

gurita sort -c C1 C2 C3 -o O1 O2 O3

Because of this notation and the fact that ordering defaults to ascending, it is only necessary to specify all orderings to the left of column CN, where CN is the rightmost column with non-default ordering. Or in other words, it is redunant to specify ascending orderings on the rightmost columns.

For example the following command:

gurita sort -c C1 C2 C3 -o a d a

can be simplified to:

gurita sort -c C1 C2 C3 -o a d

because the rightmost column defaults to ascending, so there is no need to specify that explicitly.

Sorting algorithm

--alg {quicksort,mergesort,heapsort,stable}

By default data is sorted using the quicksort algorithm, however it is possible to choose from alternatives:

  • quicksort (default)

  • mergesort

  • heapsort

  • stable

For most use cases quicksort is likely to be the best choice, though do note that it can have poor runtime performance on data that is already mostly sorted. Also note that quicksort is not stable, as noted below.

Note

Quicksort is not a stable sorting algorithm. This means that when there are ties for equal values, the output rows may not retain the same relative order as the input data.

However, the sorting algorithm can be changed using the --alg option, and the mergesort and stable algorithms are stable.

Ordering of missing (NA) values

--napos {first,last}

Suppose that the example input data is modifed such that the first and last rows have missing values for sepal_length and stored in a file called missing.csv:

sepal_length,sepal_width,petal_length,petal_width,species
,3.4,5.6,2.4,virginica
6.3,2.5,5.0,1.9,virginica
4.8,3.4,1.9,0.2,setosa
6.3,3.3,4.7,1.6,versicolor
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
6.4,2.8,5.6,2.1,virginica
5.4,3.9,1.7,0.4,setosa
5.9,3.0,4.2,1.5,versicolor
,3.5,1.5,0.2,setosa

To sort the data on the speal_length column, we will need to decide what to do with the two rows with missing values.

By default Gurita will place rows with missing values at the end of the sorted data, like so:

gurita sort -c sepal_length < missing.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
4.7,3.2,1.3,0.2,setosa
4.8,3.4,1.9,0.2,setosa
5.4,3.9,1.7,0.4,setosa
5.9,3.0,4.2,1.5,versicolor
6.3,2.5,5.0,1.9,virginica
6.3,3.3,4.7,1.6,versicolor
6.4,3.2,4.5,1.5,versicolor
6.4,2.8,5.6,2.1,virginica
,3.4,5.6,2.4,virginica
,3.5,1.5,0.2,setosa

Observe that the two rows with missing sepal_length values appear last in the sorted data.

The --napos argument tells Gurita how to handle rows with missing values in the sorted columns. Setting it to first will cause rows with missing values in the sorted columns to appear first in the sorted output.

gurita sort -c sepal_length --napos first < missing.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
,3.4,5.6,2.4,virginica
,3.5,1.5,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.8,3.4,1.9,0.2,setosa
5.4,3.9,1.7,0.4,setosa
5.9,3.0,4.2,1.5,versicolor
6.3,2.5,5.0,1.9,virginica
6.3,3.3,4.7,1.6,versicolor
6.4,3.2,4.5,1.5,versicolor
6.4,2.8,5.6,2.1,virginica

Observe that the two rows with missing sepal_length values appear first in the sorted data.

It is also possible to remove rows or columns that contain missing values based on certain criteria using the dropna command.