Missing values
Gurita supports data sets with missing values.
Gurita uses Pandas to read, write and manipulate tabular data, and therefore inherits its behaviour for handling missing values from that library.
The working with missing data page in the Pandas documentation contains useful background information on this topic.
One of the most important consequences is that missing values are stored internally as NaN
(standing for not a number, it is a special code that is used to encode undefined or unrepresentable values).
Missing values in input data
Here is the contents of a small example CSV file that has two missing values:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,
4.9,3.0,1.4,0.2,virginica
4.7,,1.3,0.2,setosa
The first data row is missing a categorical value in the species
column. This is indicated by the comma occurring at the end of the line – there is no final value in the last column on the first data row.
The third data row is missing a numerical value in the sepal_width
column. This is indicated by two adjacent commas without in intervening value between them.
Default symbols for missing data in input files
By default the following values are interpreted as missing values in input data: the empty string, #N/A
, #N/A N/A
, #NA
, -1.#IND
-1.#QNAN
, -NaN
, -nan
, 1.#IND
, 1.#QNAN
, <NA>
, N/A
, NA
, NULL
, NaN
, n/a
, nan
, null
.
Using N/A
and null
to represent missing values, the above data equivalently be represented in the input file in the following way:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,N/A
4.9,3.0,1.4,0.2,virginica
4.7,null,1.3,0.2,setosa
In this case N/A
is used on the first data row to indicate a missing species
value, and null
is used on the third data row to indicate a missing sepal_length
value. It is not common to mix and match different representations of missing values in the one data set, so this example is a bit unusual. However, it does show that it is possible.
Specifying symbols to use for missing data in input files
You can override the default symbols used for representing missing data in input files using the --navalues
argument to the in
command.
For example, suppose you want to use the symbols -
(a single dash), NA
and the empty string as symbols for missing values, then you can specify this as follows:
cat example.csv | gurita in --navalues '-' '' 'NA'
Note than when --navalues
is used the default missing value symbols no longer apply, and only those symbols given as arguments to --navalues
will be used to represent missing values.
Missing values in output data
When writing a data set as output, by default Gurita will use empty fields to indicate missing values. However, this can be overridden with the --na <str>
argument, where <str>>
will be used to indicate a missing value.
For example, suppose that the previous small data set with missing values is stored in a file called missing.csv
. The following command will feed the contents of that file into the standard input of Gurita, which will then write the file back out to standard output with missing values indicated by
underscores:
cat missing.csv | gurita out --na '_'
The resulting data will be represented like so:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,_
4.9,3.0,1.4,0.2,virginica
4.7,_,1.3,0.2,setosa
Dropping rows or columns containing missing values
The dropna command allows you to remove rows or columns that contain missing values based on certain criteria.