A. CSV Throughput
We now compare the CSV loaders in a cold and warm state. The plots are drawn in log-scale to highlight the relative differences between the methods.
Wise ParaText achieves the highest load throughput on all data sets in both a cold and warm state.
The throughput drops if the columns are then summed. However, the drop is large for Spark even if the DataFrame is cached. The drop is even more significant for text-heavy and categorical-heavy CSV.
There is no appreciable improvement for Dato and Spark on a warm filesystem. Wise ParaText realizes as much as 200 MB/sec higher throughput for float and text files.
Dato SFrame readcsv
, NumPy loadtxt
, SparkCSV, and Wise ParaText have substantially higher throughput on CSV files containing only floats. ParaText's higher efficiency enables it to make use of increases in I/O bandwidth (2 to 8 disks) compared to the other methods.
Some methods perform better on narrow data than wide data. ParaText's throughput more than doubled from the additional bandwidth. All other methods saw no more than a 10% improvement.
Benchmark Notes
- AWS does not publish the make and model of their SSDs. The I/O bandwidth shown for reference purposes assumes AWS uses datacenter-grade SSD drives. As of June 2016, the lowest throughput among all such devices sold by Intel is 500 MB/sec. See the Intel DC S3700 SSD Data Sheet. The reference bandwidth shown will be updated if the actual model is provided by AWS.
- R
read.csv
hung onfloats2
,floats3
,floats4
CSV files for over 10 hours. The process was manually killed in each case. - NumPy exhausted all of the physical memory on
mnist8m
so we added 100 GB of swap so it could finish. - SparkCSV crashed on
messy.csv
andmessy2.csv
because it does not support quoted newlines. - Spark DataFrames crashed when summing the columns for
car.csv
. SparkCSV has difficulty with punctuation in column names. We reran Spark on a new file with a header with punctuation removed.