B. ParaText vs Binary Readers

A comparison between CSV and binary readers is like comparing bicycles and motorcycles. Unlike CSV readers, binary data readers do not need to do any parsing or type inference. Error checking is minimal. The columns are often laid out on disk as they would be in memory. Moreover, a binary file is usually much more compact than a CSV file. Therefore, we expect a binary reader to finish before ParaText on many data sets, and this does not give interesting insights.

It is useful to evaluate how well binary readers perform relative to the I/O bandwidth. For this reason, we compare the throughputs rather than the run times.

Since it is very common to use serialization to save data sets, we also benchmarked Pickle for completeness.

Feather has almost twice the cold throughput over ParaText for messy2, a file that is rich in text data. Text fields require minimal parsing so ParaText strategies for mapping strings to categories may be to blame for the lower throughput.

ParaText has higher cold throughput than all other methods on 3 of the 4 float-heavy files. NPY reigns supreme on the smaller 1.2 GB floats. NPY has two to three times higher warm throughput over cold throughput on float-heavy data sets. Feather also does better on a warm filesystem.

The throughput gains are not significant for warm versus cold reads on text and categorical data sets. Millions of Python strings must be constructed in a single thread to populate a string column. This could explain the drop in throughput.

Experiment Notes

  1. HDF5 and NPY are designed specifically for numeric files so we did not benchmark them on text or categorical data.

results matching ""

    No results matching ""