sample
Randomly sample rows, either a fixed number or a fraction of the rows in the dataset.
Usage
gurita sample [-h] NUM
Arguments
Argument |
Description |
Reference |
---|---|---|
|
display help for this command |
|
|
number or fraction of rows to sample |
Simple example
The example below randomly samples 10 rows from the iris.csv
dataset:
gurita sample 10 < iris.csv
The output of the above command is as follows:
sepal_length,sepal_width,petal_length,petal_width,species
6.5,3.0,5.2,2.0,virginica
6.8,3.2,5.9,2.3,virginica
5.0,3.2,1.2,0.2,setosa
5.0,3.4,1.6,0.4,setosa
6.2,2.8,4.8,1.8,virginica
6.7,3.0,5.2,2.3,virginica
6.2,3.4,5.4,2.3,virginica
5.3,3.7,1.5,0.2,setosa
5.8,4.0,1.2,0.2,setosa
6.3,3.3,4.7,1.6,versicolor
The example below randomly samples 0.05 (20%) of the rows from the iris.csv
dataset:
gurita sample 0.05 < iris.csv
The output of the above command is as follows:
sepal_length,sepal_width,petal_length,petal_width,species
7.7,2.8,6.7,2.0,virginica
5.5,2.4,3.8,1.1,versicolor
4.6,3.4,1.4,0.3,setosa
5.7,2.9,4.2,1.3,versicolor
6.6,2.9,4.6,1.3,versicolor
5.4,3.9,1.7,0.4,setosa
5.1,3.8,1.5,0.3,setosa
5.1,3.8,1.6,0.2,setosa
Note
Due to the randomisation of sampling, mulitple runs of the same command may pick different rows.
Getting help
The full set of command line arguments for sample
can be obtained with the -h
or --help
arguments:
gurita sample -h
Number or fraction of rows to sample
The sample
command takes one non-negative numerical argument NUM
. Either a fixed number of rows or a fraction of rows will be sampled:
NUM >= 1
: the argument specifies a fixed number of rows to sample0 <= NUM < 1
: the argument specifies a fraction of the number of rows from the input data to be sampled
Note
The maximum number of rows you can randomly sample is bounded by the total number of rows in the input file. If you ask to randomly sample more than the total number of rows, no sampling will be done, and all the input rows will be used.