B. Benchmark Scripts

The benchmark scripts are in the paratext git repo. The data files used are available as an EBS snapshot with id snap-e96b3609, except for the 1TB+ files, which are easily generated with the programs provided.

Benchmark Configuration

Each benchmark has a JSON file that describes the benchmark. It includes:

  • the method tested,
  • the parameters for the method,
  • the input data file,
  • the statistics to record,
  • the state of the disk,
  • and the filename of the log to store the results.

Here is an example:

{
 "log": "numpy/run-7e0abc55.log", 
 "cmd": "numpy", 
 "no_header": true, 
 "filename": "mnist8m.csv", 
 "disk_state": "warm", 
 "sum_after": true
}

Benchmark Generator Script

The generate_experiments.py file generates all the benchmark configuration files. A folder is created for each method:

  • avgcols: tests out-of-core CSV parsing and summing/averaging
  • feather: tests the Feather binary format
  • numpy: tests the np.loadtxt text reader
  • paratext: tests the ParaText multi-threaded CSV loader
  • pyspark: tests Spark Data Frames and SparkCSV
  • sframe: tests SFrames and SFrame.read_csv
  • countnl: tests parallel newline counting
  • disk-to-mem: tests disk-to-memory performance
  • hdf5: tests HDF5 binary format
  • npy: tests NPY binary format
  • pandas: tests Pandas DataFrames and CSV loading
  • pickle: tests Pickle
  • R-fread: tests the data.table package's fread function
  • R-readcsv: tests the R read.csv function
  • R-readr: tests the readr package's read_csv function

Each benchmark has a unique hash that is used as part of the filename, e.g., experiments/pandas/run-7ea6cdc0.json.

Benchmark Run Script

The run_experiment.py script takes in a JSON benchmark file and runs a single trial of the benchmark. For example:

python run_experiment.py experiments/pandas/run-7ea6cdc0.json log_dir=/bench-logs/

runs the Pandas benchmark with hash 7ea6cdc0.

The experiment script never overwrites the log file—it is always appended. Simply run the command again to conduct another trial. We performed 3 trials for every benchmark.

A key=value pair in the arguments overrides the key of the same name in the JSON configuration file. A dash filename - runs a benchmark defined solely by key=value command line arguments.

python run_experiment.py - cmd=countnl filename=myfile.csv

This will count newlines in the myfile.csv file. These commands can be used to test your environment and installation. The log is sent to standard output by default.

The compile_log_files.py script compiles a collection of log files into CSV files:

python compile_log_files.py logs/J/raid0-2/ logs/J/raid0-8/

It generates a CSV file for each method:

  • log-avgcols.csv
  • log-feather.csv
  • log-numpy.csv
  • log-paratext.csv
  • log-pyspark.csv
  • log-sframe.csv
  • log-countnl.csv
  • log-hdf5.csv
  • log-npy.csv
  • log-pandas.csv
  • log-pickle.csv
  • log-R-fread.csv
  • log-R-readcsv.csv
  • log-R-readr.csv

Here is an example of the contents of a compiled log file:

$ head -7 log-numpy.csv 
cmd,did,disk_state,filename,filesize,mem,no_header,runtime,sum_after,sum_time,tput_MB,tput_MiB,tput_bytes,log_key,ds
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243205.75,1.0,6603.34097195,1.0,206.93819809,2.26558582399,2.16063101195,2265585.82399,run-80065715,mnist8m
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243184.558594,1.0,6586.10460901,1.0,192.314244032,2.27151504343,2.16628555625,2271515.04343,run-80065715,mnist8m
numpy,raid0-2,cold,mnist8m.csv,14960435697.0,243174.839844,1.0,6599.698488,1.0,223.564539194,2.26683623869,2.16182350033,2266836.23869,run-80065715,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243246.957031,1.0,6522.92135906,1.0,219.32184577,2.29351771599,2.18726893996,2293517.71599,run-7e0abc55,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243179.617188,1.0,7426.95527506,1.0,679.885360003,2.01434304408,1.92102722557,2014343.04408,run-7e0abc55,mnist8m
numpy,raid0-2,warm,mnist8m.csv,14960435697.0,243262.519531,1.0,7516.90489602,1.0,1008.10578609,1.99023878896,1.89803961655,1990238.78896,run-7e0abc55,mnist8m

results matching ""

    No results matching ""