A. CSV Throughput

We now compare the CSV loaders in a cold and warm state. The plots are drawn in log-scale to highlight the relative differences between the methods.

Wise ParaText achieves the highest load throughput on all data sets in both a cold and warm state.

The throughput drops if the columns are then summed. However, the drop is large for Spark even if the DataFrame is cached. The drop is even more significant for text-heavy and categorical-heavy CSV.

There is no appreciable improvement for Dato and Spark on a warm filesystem. Wise ParaText realizes as much as 200 MB/sec higher throughput for float and text files.

Dato SFrame readcsv, NumPy loadtxt, SparkCSV, and Wise ParaText have substantially higher throughput on CSV files containing only floats. ParaText's higher efficiency enables it to make use of increases in I/O bandwidth (2 to 8 disks) compared to the other methods.

Some methods perform better on narrow data than wide data. ParaText's throughput more than doubled from the additional bandwidth. All other methods saw no more than a 10% improvement.

Benchmark Notes

  1. AWS does not publish the make and model of their SSDs. The I/O bandwidth shown for reference purposes assumes AWS uses datacenter-grade SSD drives. As of June 2016, the lowest throughput among all such devices sold by Intel is 500 MB/sec. See the Intel DC S3700 SSD Data Sheet. The reference bandwidth shown will be updated if the actual model is provided by AWS.
  2. R read.csv hung on floats2, floats3, floats4 CSV files for over 10 hours. The process was manually killed in each case.
  3. NumPy exhausted all of the physical memory on mnist8m so we added 100 GB of swap so it could finish.
  4. SparkCSV crashed on messy.csv and messy2.csv because it does not support quoted newlines.
  5. Spark DataFrames crashed when summing the columns for car.csv. SparkCSV has difficulty with punctuation in column names. We reran Spark on a new file with a header with punctuation removed.

results matching ""

    No results matching ""