dropna

Drop rows or columns from the data that contain missing (NA) values.

Usage

gurita dropna [-h] [--axis AXIS] [--how METHOD] [--thresh N] [-c COLUMN [COLUMN ...]]

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

--axis {rows,columns}

drop either rows or columns (default: rows)

axis

--thresh N

keep only those rows/columns with at least N non-missing values

thresh

  • -c [COLUMN ...]

  • --col [COLUMN ...]

consider only specified columns when dropping rows

columns

--how {any,all}

drop rows/columns containing any or all missing values

how

Note

For information about how missing data is represented see the documentation on missing (NA) values.

Simple example

Consider the following contents of a CSV file that has two missing values. In the following examples we will assume that this data is stored in a filed called missing.csv.

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,
4.9,3.0,1.4,0.2,virginica
4.7,,1.3,0.2,setosa

The first data row is missing a categorical value in the species column. The third data row is missing a numerical value in the sepal_width column.

The following command drops all the rows that contain at least one column with a missing value:

gurita dropna < missing.csv

The result of the above command is shown below, where only the middle row of the input data remains:

sepal_length,sepal_width,petal_length,petal_width,species
4.9,3.0,1.4,0.2,virginica

Getting help

The full set of command line arguments for dropna can be obtained with the -h or --help arguments:

gurita dropna -h

Choose between dropping rows or columns

--axis {rows,columns}

By default dropna will remove rows from the dataset, however, it can also drop columns instead.

You can choose between dropping rows or columns with the --axis argument.

The following command drops all the columns that contain at least one column with a missing value:

gurita dropna --axis columns < missing.csv

The result of the above command is shown below, where species and sepal_width columns have been removed because they contained rows with missing values:

sepal_length,petal_length,petal_width
5.1,1.4,0.2
4.9,1.4,0.2
4.7,1.3,0.2

Set a minumum number of non-missing values

--thresh N

By default dropna drops rows or columns that contain at least one missing value.

Or, in other words, it retains only rows or columns that have no missing values.

The --thresh N argument sets a threshold N, such that only rows or columns with at least N non-missing values in them will be retained. This can be useful when you want to ensure that a minimum number of data values are present.

The following example requires at least 5 non-missing values across columns to be present for a row to be retained:

gurita dropna --thresh 5 < missing.csv

And the following example requires at least 3 non-missing values across rows to be present for a column to be retained:

gurita dropna --thresh 3 --axis columns < missing.csv

Consider only specified columns when dropping rows

-c [COLUMN ...], --col [COLUMN ...]

By default, when dropping rows, dropna will look for missing values in all columns. The --col option lets you specify a subset of columns to consider for missing values.

Note

This option does not apply when --axis columns is also used

gurita dropna --col species < missing.csv

The output of the above command is shown below:

sepal_length,sepal_width,petal_length,petal_width,species
4.9,3.0,1.4,0.2,virginica
4.7,,1.3,0.2,setosa

Only the first row from the input data has a missing value in the species column, so only that row is dropped in the above example, all other rows are retained. Note that the third row from the input data is retained even though it contains a missing value, this is because the missing value did not occur in the species column.

Drop rows/columns containing any or all missing values

--how {any,all}

By default dropna will drop rows or columns that have any (at least one) missing value. However, this behaviour can be changed with the --how all option, which only drops rows or columns where all the values are missing.

The following example drops rows that are missing all their values:

gurita dropna --how all < missing.csv