B. Benchmark Scripts
The benchmark scripts are in the paratext git repo. The data files used are available as an EBS snapshot with id snap-e96b3609
, except for the 1TB+ files, which are easily generated with the programs provided.
Benchmark Configuration
Each benchmark has a JSON file that describes the benchmark. It includes:
- the method tested,
- the parameters for the method,
- the input data file,
- the statistics to record,
- the state of the disk,
- and the filename of the log to store the results.
Here is an example:
{
"log": "numpy/run-7e0abc55.log",
"cmd": "numpy",
"no_header": true,
"filename": "mnist8m.csv",
"disk_state": "warm",
"sum_after": true
}
Benchmark Generator Script
The generate_experiments.py
file generates all the benchmark configuration files. A folder is created for each method:
avgcols
: tests out-of-core CSV parsing and summing/averagingfeather
: tests the Feather binary formatnumpy
: tests thenp.loadtxt
text readerparatext
: tests theParaText
multi-threaded CSV loaderpyspark
: tests Spark Data Frames and SparkCSVsframe
: tests SFrames andSFrame.read_csv
countnl
: tests parallel newline countingdisk-to-mem
: tests disk-to-memory performancehdf5
: tests HDF5 binary formatnpy
: tests NPY binary formatpandas
: tests Pandas DataFrames and CSV loadingpickle
: tests PickleR-fread
: tests thedata.table
package'sfread
functionR-readcsv
: tests the Rread.csv
functionR-readr
: tests thereadr
package'sread_csv
function
Each benchmark has a unique hash that is used as part of the filename, e.g., experiments/pandas/run-7ea6cdc0.json
.
Benchmark Run Script
The run_experiment.py
script takes in a JSON benchmark file and runs a single trial of the benchmark. For example:
python run_experiment.py experiments/pandas/run-7ea6cdc0.json log_dir=/bench-logs/
runs the Pandas benchmark with hash 7ea6cdc0
.
The experiment script never overwrites the log file—it is always appended. Simply run the command again to conduct another trial. We performed 3 trials for every benchmark.
A key=value
pair in the arguments overrides the key of the same name in the JSON configuration file. A dash filename -
runs a benchmark defined solely by key=value
command line arguments.
python run_experiment.py - cmd=countnl filename=myfile.csv
This will count newlines in the myfile.csv
file. These commands can be used to test your environment and installation. The log is sent to standard output by default.
The compile_log_files.py
script compiles a collection of log files into CSV files:
python compile_log_files.py logs/J/raid0-2/ logs/J/raid0-8/
It generates a CSV file for each method:
log-avgcols.csv
log-feather.csv
log-numpy.csv
log-paratext.csv
log-pyspark.csv
log-sframe.csv
log-countnl.csv
log-hdf5.csv
log-npy.csv
log-pandas.csv
log-pickle.csv
log-R-fread.csv
log-R-readcsv.csv
log-R-readr.csv
Here is an example of the contents of a compiled log file:
$ head -7 log-numpy.csv
cmd,did,disk_state,filename,filesize,mem,no_header,runtime,sum_after,sum_time,tput_MB,tput_MiB,tput_bytes,log_key,ds
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243205.75,1.0,6603.34097195,1.0,206.93819809,2.26558582399,2.16063101195,2265585.82399,run-80065715,mnist8m
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243184.558594,1.0,6586.10460901,1.0,192.314244032,2.27151504343,2.16628555625,2271515.04343,run-80065715,mnist8m
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243174.839844,1.0,6599.698488,1.0,223.564539194,2.26683623869,2.16182350033,2266836.23869,run-80065715,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243246.957031,1.0,6522.92135906,1.0,219.32184577,2.29351771599,2.18726893996,2293517.71599,run-7e0abc55,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243179.617188,1.0,7426.95527506,1.0,679.885360003,2.01434304408,1.92102722557,2014343.04408,run-7e0abc55,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243262.519531,1.0,7516.90489602,1.0,1008.10578609,1.99023878896,1.89803961655,1990238.78896,run-7e0abc55,mnist8m