pivot
Reshape data from narrow format to wide format.
Sometimes wide format is called stacked and narrow format is called unstacked or long.
For example here is a small table in narrow format representing working hours for two employees on each weekday, where employees are also associated with a level:
person,feature,val
Alice,level,A1
Bob,level,B3
Alice,mon,8
Bob,mon,0
Alice,tue,8
Bob,tue,0
Alice,wed,4
Bob,wed,4
Alice,thu,1
Bob,thu,6
Alice,fri,4
Bob,fri,0
And here is the same data in wide format:
person,level,mon,tue,wed,thu,fri
Alice,A1,8,8,4,1,4
Bob,B3,0,0,4,6,0
In this particular example the person
column has been used as an index for the output rows, and the values in the feature
column have been unstacked into column headings, with values from the val
column populating the rows.
An alternative wide representation of the same dataset is shown below, where the feature
column
has been used as an index of the output rows, and the values in the person
column have been
unstacked into new column headings. As before the values from the val
column populate the rows.
feature,Alice,Bob
fri,4,0
level,A1,B3
mon,8,0
thu,1,6
tue,8,0
wed,4,4
Usage
gurita pivot [-h] -c COLUMN [COLUMN ...] -i COLUMN [COLUMN ...]
[-v COLUMN [COLUMN ...]] [-f FUNCTION [FUNCTION ...]]
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
use these columns as the index (required) |
|
|
unstack these columns (required) |
|
|
populate the rows from these columns |
|
|
Aggregation function |
See also
Getting help
The full set of command line arguments for pivot
can be obtained with the -h
or --help
arguments:
gurita pivot -h
Example
Suppose the following data is stored in a file called example.csv
:
person,feature,val
Alice,level,A1
Bob,level,B3
Alice,mon,8
Bob,mon,0
Alice,tue,8
Bob,tue,0
Alice,wed,4
Bob,wed,4
Alice,thu,1
Bob,thu,6
Alice,fri,4
Bob,fri,0
This is an example of data in “narrow format” (or “unstacked” or “long”).
The pivot
command can convert the data into “wide format”, as demonstrated in the following example:
gurita pivot -i person -c feature -v val < example.csv
The output of the above command is as follows:
person,fri,level,mon,thu,tue,wed
Alice,4,A1,8,1,8,4
Bob,0,B3,0,6,0,4
In this example the person
column is used as the index of the output data; it acts like a key for the new rows. There are two distinct values for person
(Alice
and Bob
), so there are two rows in the output data.
The remaining new columns in the output data are derived from the values in the feature
column (fri
, level
, mon
, thu
, tue
, wed
). You’ll note that the new columns have been generated in an arbitrary order, however,
usually this is not important because the order of the columns in the data rarely matters.
The values in the corresponding rows are determined from the val
column. For instance, in the original data set, given Alice
as a key, level
is associated uniqely with the value A1
. Therefore, in the output data, the row for Alice
in the level
column contains A1
.
As an alternative example, we could pivot the data in a different way, by using the feature
column as the index, and unstacking the person
column:
gurita pivot -i feature -c person -v val < example.csv
The output of the above command is as follows:
feature,Alice,Bob
fri,4,0
level,A1,B3
mon,8,0
thu,1,6
tue,8,0
wed,4,4
In this example there are six values in the feature
column, so there are correspondingly six rows in the output. Also, the two unique values in the person
column have been unstacked into two new columns in the output.
These two examples show that pivot
provides a mechanism to structure and view the same dataset in multiple ways.
The inverse of pivot
is melt
. For example the following commands show how to invert the
pivoted data in wide format back to an equivalent of the input long format:
gurita pivot -i person -c feature -v val + melt -i person -v mon tue wed thu fri level --varname feature --valname val < example.csv
The output of the above command is as follows:
person,feature,val
Alice,mon,8
Bob,mon,0
Alice,tue,8
Bob,tue,0
Alice,wed,4
Bob,wed,4
Alice,thu,1
Bob,thu,6
Alice,fri,4
Bob,fri,0
Alice,level,A1
Bob,level,B3
gurita pivot -i feature -c person -v val + melt -i feature -v Alice Bob --varname person --valname val < example.csv
The output of the above command is as follows:
feature,person,val
fri,Alice,4
level,Alice,A1
mon,Alice,8
thu,Alice,1
tue,Alice,8
wed,Alice,4
fri,Bob,0
level,Bob,B3
mon,Bob,0
thu,Bob,6
tue,Bob,0
wed,Bob,4
Note that in both examples using pivot
followed by melt
the output data is not in
exactly the same order as the origial input data. Some of the rows and columns have been
reordered. However, the data is semantically equivalent to the original data because row and column
ordering does not normally matter. Regardless of structure, the data represents the same information.
Missing data
Consider the following modified version of the data set from above, where the row Alice,level,A1
has been removed. In this new dataset Bob
has an associated level
by Alice
does not.
person,feature,val
Bob,level,B3
Alice,mon,8
Bob,mon,0
Alice,tue,8
Bob,tue,0
Alice,wed,4
Bob,wed,4
Alice,thu,1
Bob,thu,6
Alice,fri,4
Bob,fri,0
We can pivot the data as before with the following command (suppose the input file is now called missing.csv
):
gurita pivot -i person -c feature -v val < missing.csv
The output of the above command is as follows:
person,fri,level,mon,thu,tue,wed
Alice,4,,8,1,8,4
Bob,0,B3,0,6,0,4
The level
column exists, as before, in the output. This is because level
appears in the input feature
column.
Even though level
is only defined for Bob
, all output rows must have this column. Since level
is not defined for Alice
in the input data, the only
reasonable thing to do is make this cell empty in the output.
There are many other ways that the dataset could have a missing level
value for Alice
.
For example, the data could contain the following row:
Alice,level,
In all such cases the output of the pivot
command would be the same thing, to account for
the missing level
associated with Alice
. Similarly for other missing data.
Missing data can be removed from the dataset using the dropna command.
Reshaping data
-i COLUMN [COLUMN ...]
--index COLUMN [COLUMN ...]
-c COLUMN [COLUMN ...]
--cols COLUMN [COLUMN ...]
-v COLUMN [COLUMN ...]
--vals COLUMN [COLUMN ...]
When unstacking a dataset the pivot
command groups data together into output rows
using an index (or a key), computed from one or more input columns. This is a required
argument.
Suppose we have the following dataset in long format stored in a file called example.csv
:
person,level,variable,value
Alice,A1,sun,0
Bob,B3,sun,4
Wei,B1,sun,0
Imani,A2,sun,0
Diego,C2,sun,3
Alice,A1,mon,8
Bob,B3,mon,0
Wei,B1,mon,0
Imani,A2,mon,8
Diego,C2,mon,7
Alice,A1,tue,8
Bob,B3,tue,0
Wei,B1,tue,8
Imani,A2,tue,8
Diego,C2,tue,7
One way to convert the data into wide format is to group data into output rows based on an index from the person
and level
columns, with the variable
column unstacked into new output columns, and the output rows populated by the value
column:
gurita pivot -i person level -c variable -v value < example.csv
The output of the above command is as follows:
person,level,mon,sun,tue
Alice,A1,8,0,8
Bob,B3,0,4,0
Diego,C2,7,3,7
Imani,A2,8,0,8
Wei,B1,0,0,8
In this case we can see that the output rows are indexed by a key formed from the person
and level
input columns. For instance all entries for Alice
and A1
are grouped together.
In a more contrived example, we could form an index from the level
and variable
input columns and unstack the person
column, like so:
gurita pivot -i level variable -c person -v value < example.csv
The output of the above command is shown below:
level,variable,Alice,Bob,Diego,Imani,Wei
A1,mon,8.0,,,,
A1,sun,0.0,,,,
A1,tue,8.0,,,,
A2,mon,,,,8.0,
A2,sun,,,,0.0,
A2,tue,,,,8.0,
B1,mon,,,,,0.0
B1,sun,,,,,0.0
B1,tue,,,,,8.0
B3,mon,,0.0,,,
B3,sun,,4.0,,,
B3,tue,,0.0,,,
C2,mon,,,7.0,,
C2,sun,,,3.0,,
C2,tue,,,7.0,,
Aggregating multiple values
Suppose Alice
has two values associated with level
(B2
and A1
), where previously
there was only one:
person,feature,val
Alice,level,B2
Alice,level,A1
Bob,level,B3
Alice,mon,8
Bob,mon,0
Alice,tue,8
Bob,tue,0
Alice,wed,4
Bob,wed,4
Alice,thu,1
Bob,thu,6
Alice,fri,4
Bob,fri,0
We can try to pivot the data as before:
gurita pivot -i person -c feature -v val < example.csv
However, this generates an error:
gurita ERROR: Error: Index contains duplicate entries, cannot reshape; exiting
The problem is that the output data must have a single cell corresponding to the level
associated with Alice
. However, in this data set there are two level
values for Alice
, and pivot
does not know how to resolve this multiplicity.
This situation can be resolved by specifying an aggregating function to map multiple values into a single result.
For example, we could use -f sample
to pick a random value from the collection of possibilities:
gurita pivot -i person -c feature -v val -f sample < example.csv
The output of the above command is as follows:
person,fri,level,mon,thu,tue,wed
Alice,4,B2,8,1,8,4
Bob,0,B3,0,6,0,4
Note that the output may differ each time the command is run because it chooses a level
value for Alice
at random.
The behaviour of the aggregating function for pivot
is similar to that of the groupby command.
Allowed aggregation functions
The following aggregating functions can be used with -f/--fun
:
sample (randomly choose one of the possible values)
size (size of the group)
sum
mean
mad (mean absolute deviation)
median
min
max
prod
std (standard deviation)
var (variance)
sem (standard error of the mean)
skew
quantile (50% quantile)