describe
Print summary statistics of the columns in a dataset to standard output (stdout).
The summary includes the following information:
count: the number of non-empty data values observed for the column
For categorical columns:
unique: the number of unique values observed for the column
top: the most frequently observed value
freq: the frequency (count) of the most frequently observed value
For numerical columns:
mean: the mean (average)
std: the standard deviation
min: the minimum observed value
25%: the 25th percentile
50%: the 50th percentile
75%: the 75th percentile
max: the maximum observed value
Usage
gurita describe [-h] [-c [COLUMN ...]]
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
select columns to describe (default is all columns) |
See also
The pretty command prints a fragment of the dataset in an aligned tabular format to standard output.
Simple example
Describe all the columns in the titanic.csv
file:
gurita describe < titanic.csv
The output for the above command is as follows:
survived pclass sex age sibsp parch \
count 891.000000 891.000000 891 714.000000 891.000000 891.000000
unique NaN NaN 2 NaN NaN NaN
top NaN NaN male NaN NaN NaN
freq NaN NaN 577 NaN NaN NaN
mean 0.383838 2.308642 NaN 29.699118 0.523008 0.381594
std 0.486592 0.836071 NaN 14.526497 1.102743 0.806057
min 0.000000 1.000000 NaN 0.420000 0.000000 0.000000
25% 0.000000 2.000000 NaN 20.125000 0.000000 0.000000
50% 0.000000 3.000000 NaN 28.000000 0.000000 0.000000
75% 1.000000 3.000000 NaN 38.000000 1.000000 0.000000
max 1.000000 3.000000 NaN 80.000000 8.000000 6.000000
fare embarked class who adult_male deck embark_town alive \
count 891.000000 889 891 891 891 203 889 891
unique NaN 3 3 3 2 7 3 2
top NaN S Third man True C Southampton no
freq NaN 644 491 537 537 59 644 549
mean 32.204208 NaN NaN NaN NaN NaN NaN NaN
std 49.693429 NaN NaN NaN NaN NaN NaN NaN
min 0.000000 NaN NaN NaN NaN NaN NaN NaN
25% 7.910400 NaN NaN NaN NaN NaN NaN NaN
50% 14.454200 NaN NaN NaN NaN NaN NaN NaN
75% 31.000000 NaN NaN NaN NaN NaN NaN NaN
max 512.329200 NaN NaN NaN NaN NaN NaN NaN
alone
count 891
unique 2
top True
freq 537
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Getting help
The full set of command line arguments for describe
can be obtained with the -h
or --help
arguments:
gurita describe -h
Select specific columns to describe
-c [COLUMN ...], --col [COLUMN ...]
By default describe
prints information about all columns in a dataset.
Alternatively, a subset of the columns can be selected using the -c/--col
argument.
As an example, The following commmand only shows summary information for the age
and class
columns in the file titanic.csv
:
gurita describe --col age class < titanic.csv
The output of the above command is as follows:
age class
count 714.000000 891
unique NaN 3
top NaN Third
freq NaN 491
mean 29.699118 NaN
std 14.526497 NaN
min 0.420000 NaN
25% 20.125000 NaN
50% 28.000000 NaN
75% 38.000000 NaN
max 80.000000 NaN
Usage in a command chain
When used in a command chain the describe
command passes on its input data to the rest of the chain unchanged.
For example, the following command shows describe
followed by a box
plot:
gurita describe + box -x sex -y age < titanic.csv
This command will first run describe
to display a summary of the data on the output, and then it will run box
to generate a plot on the same input data.
Because describe
just passes the data along from left to right the box
command receives the same data as its input that was read from the file.
It is also worth noting that describe
can be used after other transformations have been applied to the data. For example, the data can be filtered first, and then the result of filtering can be fed into describe
:
gurita filter 'age >= 30' + describe < titanic.csv
The output of the above command is as follows:
survived pclass sex age sibsp parch \
count 330.000000 330.000000 330 330.000000 330.000000 330.000000
unique NaN NaN 2 NaN NaN NaN
top NaN NaN male NaN NaN NaN
freq NaN NaN 216 NaN NaN NaN
mean 0.406061 1.948485 NaN 41.948485 0.339394 0.357576
std 0.491842 0.861405 NaN 10.259568 0.545743 0.918910
min 0.000000 1.000000 NaN 30.000000 0.000000 0.000000
25% 0.000000 1.000000 NaN 34.000000 0.000000 0.000000
50% 0.000000 2.000000 NaN 39.000000 0.000000 0.000000
75% 1.000000 3.000000 NaN 48.000000 1.000000 0.000000
max 1.000000 3.000000 NaN 80.000000 3.000000 6.000000
fare embarked class who adult_male deck embark_town alive \
count 330.000000 328 330 330 330 119 328 330
unique NaN 3 3 2 2 6 3 2
top NaN S First man True C Southampton no
freq NaN 252 131 216 216 34 252 196
mean 41.079331 NaN NaN NaN NaN NaN NaN NaN
std 61.533893 NaN NaN NaN NaN NaN NaN NaN
min 0.000000 NaN NaN NaN NaN NaN NaN NaN
25% 9.521875 NaN NaN NaN NaN NaN NaN NaN
50% 25.929200 NaN NaN NaN NaN NaN NaN NaN
75% 52.000000 NaN NaN NaN NaN NaN NaN NaN
max 512.329200 NaN NaN NaN NaN NaN NaN NaN
alone
count 330
unique 2
top True
freq 202
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Notice that the minimal value of age
is now 30.