.. gurita documentation master file, created by sphinx-quickstart on Wed Oct 14 18:01:41 2020. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Gurita: a command line data analytics and plotting tool ******************************************************* Gurita is a command line tool for transforming, analysing, and visualising tabular data in CSV or TSV format. At its core Gurita provides a :ref:`suite of commands `, each of which carries out a common data analytics or plotting task. Additionally, Gurita allows commands to be :ref:`chained together ` into flexible analysis pipelines. It is designed to be fast and convenient, and is particularly suited to data exploration tasks. Input files with large numbers of rows (> millions) are readily supported. Gurita commands are highly customisable, however sensible defaults are applied. Therefore simple tasks are easy to express and complex tasks are possible. Gurita is implemented in `Python `_ and makes extensive use of the `Pandas `_, `Seaborn `_, and `Scikit-learn `_ libraries for data processing and plot generation. Simple example -------------- The following examples use the `iris.csv `_ dataset as input. The file contains 150 data rows (plus one heading row) and 5 columns. A pretty display of the first and last 5 rows of the data can be viewed using the following command: .. code-block:: bash cat iris.csv | gurita pretty The command generates the following output that is displayed on the terminal: .. literalinclude:: example_outputs/iris_pretty.txt :language: none The following command generates a box plot from ``iris.csv``, such that the Y-axis represents the ``sepal_length`` numerical column, and the X-axis is grouped by the ``species`` categorical column. The goal of this plot is to show the distribution of sepal length of the three different species of iris flowers contained in the data set. .. code-block:: bash cat iris.csv | gurita box -x species -y sepal_length The above command generates an output file called ``box.species.sepal_length.png`` containing the following figure: .. image:: ../docs/_images/box.species.sepal_length.png :width: 600px :height: 600px :align: center :alt: Box plot showing the distribution of sepal length by species for the iris data set | If we wanted to see the median value of ``sepal_length`` for each ``species`` we can do this with the following ``groupby`` command: .. code-block:: text cat iris.csv | gurita groupby --key species --val sepal_length --fun median .. literalinclude:: example_outputs/iris_groupby_species_sepal_length_median.txt :language: none You can see that the corresponding median values match up with the values shown in the box plot. Advanced example ---------------- The next example illustrates Gurita's ability to :ref:`chain commands ` together: .. code-block:: bash cat iris.csv | gurita filter 'species != "virginica"' \ + sample 0.9 \ + pca \ + scatter -x pc1 -y pc2 --hue species The above command has been written on mulitple lines for clarity. Note that the backslash ``\`` indicates a line break in the command line syntax. Equivalently, the same command can be written in a single line, like so (where backslashes are no longer required): .. code-block:: bash cat iris.csv | gurita filter 'species != "virginica"' + sample 0.9 + pca + scatter -x pc1 -y pc2 --hue species In the command line syntax supported by Gurita, the plus sign ``+`` separates each command in a chain. This notation is fairly novel to Gurita and not something commonly seen in other command line programs. Input is read from the file called ``iris.csv`` on standard input and data is passed from left to right in the chain. Commands can modify the data as it is passed along. In this example there are four commands that are executed in the following order: 1. The ``filter 'species != "virginica"'`` command selects all rows where ``species`` is not equal to ``virginica``. 2. The filtered rows are then passed to the ``sample 0.9`` command which randomly selects 90% of the remaining rows. 3. The sampled rows are then passed to the ``pca`` command which performs principal component analysis (PCA), yielding two extra columns in the data called ``pc1`` and ``pc2``. 4. Finally the pca-transformed data is passed to the ``scatter -x pc1 -y pc2 --hue species`` command which generates a scatter plot of ``pc1`` and ``pc2`` (the first two principal components), and colours the points in the plot based on the categorical ``species`` column. The output of this entire Gurita command is the scatter plot below that is saved in a file called ``scatter.pc1.pc2.species.png`` .. image:: ../docs/_images/scatter.pc1.pc2.species.png :width: 700px :height: 600px :align: center :alt: Scatter plot comparing principal components pc1 and pc2 from a filtered iris dataset | An interesting aspect of this example is that additional columns are added to the data in the ``pca`` step, and these extra columns are then used to define the axes of the following scatter plot. Command chaining allows complex transformations to be combined together and makes Gurita a very powerful tool. .. toctree:: :caption: Overview :hidden: self license installation example_input_data .. toctree:: :caption: General behaviour :hidden: command_line_syntax input_output missing plotting list_of_commands .. toctree:: :hidden: :maxdepth: 1 :caption: Development contributing bug_reports feature_requests