sample

Randomly sample rows, either a fixed number or a fraction of the rows in the dataset.

Usage

gurita sample [-h] NUM

Arguments

Argument

Description

Reference

  • -h

  • --help

display help for this command

help

NUM

number or fraction of rows to sample

sample number or fraction

Simple example

The example below randomly samples 10 rows from the iris.csv dataset:

gurita sample 10 < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
6.5,3.0,5.2,2.0,virginica
6.8,3.2,5.9,2.3,virginica
5.0,3.2,1.2,0.2,setosa
5.0,3.4,1.6,0.4,setosa
6.2,2.8,4.8,1.8,virginica
6.7,3.0,5.2,2.3,virginica
6.2,3.4,5.4,2.3,virginica
5.3,3.7,1.5,0.2,setosa
5.8,4.0,1.2,0.2,setosa
6.3,3.3,4.7,1.6,versicolor

The example below randomly samples 0.05 (20%) of the rows from the iris.csv dataset:

gurita sample 0.05 < iris.csv

The output of the above command is as follows:

sepal_length,sepal_width,petal_length,petal_width,species
7.7,2.8,6.7,2.0,virginica
5.5,2.4,3.8,1.1,versicolor
4.6,3.4,1.4,0.3,setosa
5.7,2.9,4.2,1.3,versicolor
6.6,2.9,4.6,1.3,versicolor
5.4,3.9,1.7,0.4,setosa
5.1,3.8,1.5,0.3,setosa
5.1,3.8,1.6,0.2,setosa

Note

Due to the randomisation of sampling, mulitple runs of the same command may pick different rows.

Getting help

The full set of command line arguments for sample can be obtained with the -h or --help arguments:

gurita sample -h

Number or fraction of rows to sample

The sample command takes one non-negative numerical argument NUM. Either a fixed number of rows or a fraction of rows will be sampled:

  • NUM >= 1: the argument specifies a fixed number of rows to sample

  • 0 <= NUM < 1: the argument specifies a fraction of the number of rows from the input data to be sampled

Note

The maximum number of rows you can randomly sample is bounded by the total number of rows in the input file. If you ask to randomly sample more than the total number of rows, no sampling will be done, and all the input rows will be used.