B. Benchmark Scripts

The benchmark scripts are in the paratext git repo. The data files used are available as an EBS snapshot with id snap-e96b3609, except for the 1TB+ files, which are easily generated with the programs provided.

Benchmark Configuration

Each benchmark has a JSON file that describes the benchmark. It includes:

the method tested,
the parameters for the method,
the input data file,
the statistics to record,
the state of the disk,
and the filename of the log to store the results.

Here is an example:

{
 "log": "numpy/run-7e0abc55.log", 
 "cmd": "numpy", 
 "no_header": true, 
 "filename": "mnist8m.csv", 
 "disk_state": "warm", 
 "sum_after": true
}

Benchmark Generator Script

The generate_experiments.py file generates all the benchmark configuration files. A folder is created for each method:

avgcols: tests out-of-core CSV parsing and summing/averaging
feather: tests the Feather binary format
numpy: tests the np.loadtxt text reader
paratext: tests the ParaText multi-threaded CSV loader
pyspark: tests Spark Data Frames and SparkCSV
sframe: tests SFrames and SFrame.read_csv
countnl: tests parallel newline counting
disk-to-mem: tests disk-to-memory performance
hdf5: tests HDF5 binary format
npy: tests NPY binary format
pandas: tests Pandas DataFrames and CSV loading
pickle: tests Pickle
R-fread: tests the data.table package's fread function
R-readcsv: tests the R read.csv function
R-readr: tests the readr package's read_csv function

Each benchmark has a unique hash that is used as part of the filename, e.g., experiments/pandas/run-7ea6cdc0.json.

Benchmark Run Script

The run_experiment.py script takes in a JSON benchmark file and runs a single trial of the benchmark. For example:

python run_experiment.py experiments/pandas/run-7ea6cdc0.json log_dir=/bench-logs/

runs the Pandas benchmark with hash 7ea6cdc0.

The experiment script never overwrites the log file—it is always appended. Simply run the command again to conduct another trial. We performed 3 trials for every benchmark.

A key=value pair in the arguments overrides the key of the same name in the JSON configuration file. A dash filename - runs a benchmark defined solely by key=value command line arguments.

python run_experiment.py - cmd=countnl filename=myfile.csv

This will count newlines in the myfile.csv file. These commands can be used to test your environment and installation. The log is sent to standard output by default.

The compile_log_files.py script compiles a collection of log files into CSV files:

python compile_log_files.py logs/J/raid0-2/ logs/J/raid0-8/

It generates a CSV file for each method:

log-avgcols.csv
log-feather.csv
log-numpy.csv
log-paratext.csv
log-pyspark.csv
log-sframe.csv
log-countnl.csv
log-hdf5.csv
log-npy.csv
log-pandas.csv
log-pickle.csv
log-R-fread.csv
log-R-readcsv.csv
log-R-readr.csv

Here is an example of the contents of a compiled log file:

$ head -7 log-numpy.csv 
cmd,did,disk_state,filename,filesize,mem,no_header,runtime,sum_after,sum_time,tput_MB,tput_MiB,tput_bytes,log_key,ds
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243205.75,1.0,6603.34097195,1.0,206.93819809,2.26558582399,2.16063101195,2265585.82399,run-80065715,mnist8m
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243184.558594,1.0,6586.10460901,1.0,192.314244032,2.27151504343,2.16628555625,2271515.04343,run-80065715,mnist8m
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243174.839844,1.0,6599.698488,1.0,223.564539194,2.26683623869,2.16182350033,2266836.23869,run-80065715,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243246.957031,1.0,6522.92135906,1.0,219.32184577,2.29351771599,2.18726893996,2293517.71599,run-7e0abc55,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243179.617188,1.0,7426.95527506,1.0,679.885360003,2.01434304408,1.92102722557,2014343.04408,run-7e0abc55,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243262.519531,1.0,7516.90489602,1.0,1008.10578609,1.99023878896,1.89803961655,1990238.78896,run-7e0abc55,mnist8m

B. Benchmark Scripts

B. Benchmark Scripts

Benchmark Configuration

Benchmark Generator Script

Benchmark Run Script

results matching ""

No results matching ""