sample

Randomly sample rows, either a fixed number or a fraction of the rows in the dataset.

Usage

gurita sample [-h] NUM

Arguments

Argument	Description	Reference
`-h` `--help`	display help for this command	help
`NUM`	number or fraction of rows to sample	sample number or fraction

Simple example

The example below randomly samples 10 rows from the iris.csv dataset:

gurita sample 10 < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
5,3.0,5.2,2.0,virginica
8,3.2,5.9,2.3,virginica
0,3.2,1.2,0.2,setosa
0,3.4,1.6,0.4,setosa
2,2.8,4.8,1.8,virginica
7,3.0,5.2,2.3,virginica
2,3.4,5.4,2.3,virginica
3,3.7,1.5,0.2,setosa
8,4.0,1.2,0.2,setosa
3,3.3,4.7,1.6,versicolor

The example below randomly samples 0.05 (20%) of the rows from the iris.csv dataset:

gurita sample 0.05 < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
7,2.8,6.7,2.0,virginica
5,2.4,3.8,1.1,versicolor
6,3.4,1.4,0.3,setosa
7,2.9,4.2,1.3,versicolor
6,2.9,4.6,1.3,versicolor
4,3.9,1.7,0.4,setosa
1,3.8,1.5,0.3,setosa
1,3.8,1.6,0.2,setosa

Note

Due to the randomisation of sampling, mulitple runs of the same command may pick different rows.

Getting help

The full set of command line arguments for sample can be obtained with the -h or --help arguments:

gurita sample -h

Number or fraction of rows to sample

The sample command takes one non-negative numerical argument NUM. Either a fixed number of rows or a fraction of rows will be sampled:

NUM >= 1: the argument specifies a fixed number of rows to sample
0 <= NUM < 1: the argument specifies a fraction of the number of rows from the input data to be sampled

Note

The maximum number of rows you can randomly sample is bounded by the total number of rows in the input file. If you ask to randomly sample more than the total number of rows, no sampling will be done, and all the input rows will be used.