zscore
Compute the z-score (sometimes called standard score) for values in numerical columns.
The z-score is the number of standard deviations a value is above or below the mean of a set of values. Values above the mean have a positive z-score and values below the mean have a negative z-score. The z-score of a value equal to the mean is zero.
Given a set of values, if M is the mean of the set and S is the standard deviation of the set then the z-score of some value x is defined to be:
z-score = (x - M) / S
The z-score is commonly used for normalization and is particularly useful when the data comes from a normal distrbution.
The zscore
command adds a new column to the dataset storing the computed z-scores for the values in the corresponding input column.
Z-scores for multiple input columns can be computed (separately) by the same command.
Usage
gurita zscore [-h] [-c NAME [NAME ...]] [--suffix SUFFIX]
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
determine z-scores in specified numerical columns |
|
|
choose a column name suffix for new z-score columns (default: zscore) |
See also
Outliers can be detected using the outlier
command.
Simple example
The following command computes z-scores for the sepal_width
column in the iris.csv
dataset:
gurita zscore -c sepal_width < iris.csv
The output is quite long so we can adjust the command to look at only the first few rows using the head command:
gurita zscore -c sepal_width + head < iris.csv
The output of the above command is as follows:
sepal_length,sepal_width,petal_length,petal_width,species,sepal_width_zscore
5.1,3.5,1.4,0.2,setosa,1.0320572244889565
4.9,3.0,1.4,0.2,setosa,-0.12495760117130933
4.7,3.2,1.3,0.2,setosa,0.3378483290927974
4.6,3.1,1.5,0.2,setosa,0.10644536396074403
5.0,3.6,1.4,0.2,setosa,1.2634601896210098
A new numerical column called sepal_width_zscore
is added to the dataset, this holds the computed z-scores for the corresponding values in the sepal_width
column.
Getting help
The full set of command line arguments for zscore
can be obtained with the -h
or --help
arguments:
gurita zscore -h
Compute z-scores in specified numerical columns
-c NAME [NAME ...], --col NAME [NAME ...]
By default, if no column names are specified, z-scores will be computed in all of the numerical columns in the dataset, one at a time.
For example, the following command computes z-scores in each of the numerical columns in the iris.csv
dataset separately (these are: sepal_length
, sepal_width
, petal_length
, petal_width
).
gurita zscore < iris.csv
Sometimes it is useful to specify a subset of columns in which to compute z-scores. This can be achieved with the -c/--col
argument.
In the following example z-scores are computed in only the sepal_length
and petal_width
columns:
gurita zscore -c sepal_length petal_width < iris.csv
By chaining this command with head
we can inspect the first few rows of the output:
gurita zscore -c sepal_length petal_width + head < iris.csv
The output of the above command is as follows:
sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_zscore,petal_width_zscore
5.1,3.5,1.4,0.2,setosa,-0.9006811702978088,-1.3129767272601454
4.9,3.0,1.4,0.2,setosa,-1.1430169111851105,-1.3129767272601454
4.7,3.2,1.3,0.2,setosa,-1.3853526520724133,-1.3129767272601454
4.6,3.1,1.5,0.2,setosa,-1.5065205225160652,-1.3129767272601454
5.0,3.6,1.4,0.2,setosa,-1.0218490407414595,-1.3129767272601454
In the above example we can see that z-scores are computed in just sepal_length
and petal_width
. Two new numerical columns called
sepal_length_zscore
and petal_width_zscore
are added to the dataset.
Note that in the sample of data shown in the output above all rows have the same petal_width
value and hence the corresponding values in petal_width_zscore
are also all identical.
Note
Non-numeric columns will be ignored by zscore
even if they are specified as arguments to -c/--col
.
Choose a column name suffix for new z-score columns
--suffix SUFFIX
The zscore
command adds extra numerical columns to the dataset to store the z-score values for the corresponding input columns.
The names of these extra columns are constructed by adding the suffix zscore
on to the end of the input column names, separated by an underscore.
This can be changed with the --suffix
argument.
The following command specifies that z
should be used as the suffix for the newly added columns:
gurita zscore --suffix z < iris.csv
By chaining this command with head
we can inspect the first few rows of the output:
gurita zscore --suffix z + head < iris.csv
The output of the above command is as follows:
sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_z,sepal_width_z,petal_length_z,petal_width_z
5.1,3.5,1.4,0.2,setosa,-0.9006811702978088,1.0320572244889565,-1.3412724047598314,-1.3129767272601454
4.9,3.0,1.4,0.2,setosa,-1.1430169111851105,-0.12495760117130933,-1.3412724047598314,-1.3129767272601454
4.7,3.2,1.3,0.2,setosa,-1.3853526520724133,0.3378483290927974,-1.3981381087490836,-1.3129767272601454
4.6,3.1,1.5,0.2,setosa,-1.5065205225160652,0.10644536396074403,-1.284406700770579,-1.3129767272601454
5.0,3.6,1.4,0.2,setosa,-1.0218490407414595,1.2634601896210098,-1.3412724047598314,-1.3129767272601454