G. Baselines

Our goal is to make most effective use of the available I/O bandwidth and parallelism. We defined two baseline tasks to establish upper bounds on the throughput of CSV loading:

newline counting: count the number of newline characters in parallel. Sum each worker's count.
out-of-core CSV column summing: parse CSV without error checking but with type checking. Sum the columns but do not store the data.

The first task has a small, constant memory footprint so it can make full use of the cache. Most tasks should be slower than newline counting. The second task helps evaluate the overhead of loading CSV into memory versus out-of-core computation. Its memory footprint is constant in the number of rows.

Overall Throughput

As expected, newline counting is the fastest in all cases and out-of-core CSV parsing is second best. However, ParaText is not far behind.

All three tasks benefit from a warm filesystem. The warm throughput of newline counting is four times its cold throughput: ~20 GB/sec.

Scaling Curves

The scaling curves show the change in throughput given the number of worker threads and the bandwidth. A workload may become I/O-bound when its scaling curve plateaus. At this point, additional threads give little benefit.

Newline counting plateaus at 8 threads while the other tasks plateau much later at 28 threads.

The cold throughput of loading Big MNIST plateaus around 20 threads for ParaCSV. Newline counting becomes I/O bound at around 4 threads. Out-of-core CSV parsing plateaus at 12 threads. CSV loading does not plateau until much later at 20 threads.

The scaling curves no longer plateau if we run them on a machine with higher I/O bandwidth (RAID-0 with 8 disks).

The out-of-core CSV parser does not use the auto-widening vector so it is less CPU-intensive. This could explain its much higher throughput in a cold state.

G. Baselines

G. Baselines

Overall Throughput

Scaling Curves

results matching ""

No results matching ""