Data Sets

We evaluate the CSV readers in this benchmark on four different categories of data sets:

  • categorical-heavy: contains some numeric columns, but many strings values are repeated. In these data sets, there are string columns with 100s or 1000s of unique strings with millions of rows
  • mixed: mixed categorical and numeric data with several text columns with strings spanning multiple lines
  • integer-heavy: mostly integer data
  • floating-point-heavy: mostly floating point data

We used 11 CSV files in total. The first 9 files are available for download by creating an EBS volume from our public EBS snapshot snap-e96b3609. Appendix E explains how to generate medium1 and medium2. The CSV files are converted to binary files (NPY, Feather, HDF5, Pickle) as Appendix F explains in detail. The characteristics of these data sets are shown in the following table.

Data Set Rows Columns Type CSV Size Machine
messy 1,000 1,000 mixed 21 MB A
messy2 100,000 1,000 mixed 2.10 GB A
car 13,051,349 67 categorical-heavy 6.71 GB A
mnist 64000 785 integer-heavy 55 MB A
mnist8m 8100000 785 integer-heavy 14.96 GB A
floats 40,000 2,000 float-heavy 1.21 GB A
floats2 1,000,000 1,000 float-heavy 25.50 GB A
floats3 10,000,000 100 float-heavy 25.50 GB B
floats4 100,000,000 10 float-heavy 25.50 GB B
medium1 7,000,000,000 5 float-heavy 1.015 TB B
medium2 35,000,000,000 5 float-heavy 5.076 TB B

We loaded each CSV file with ParaText and then saved the result to NPY, HDF5, Pickle, and Feather formats. This puts the binary readers at a further advantage over ParaText because ParaText chooses the narrowest type for numeric data. Thus, the binary files are much more compact than their CSV counterparts.

A comparison of runtimes does not give much insight into how each reader exploits the storage system. First, the input file sizes are drastically different. Second, runtimes do not measure how effectively a method makes use of the I/O bandwidth.

Exclusions

It was not possible to run every method on every data set. We list the cases below.

Method Data Sets Reason
numpy.loadtxt car, messy, messy2 It is designed primarily for numeric data.
NPY car, messy, messy2 Same.
HDF5 car, messy, messy2 H5PY is difficult to use for mixed column types.
SparkCSV messy, messy2 SparkCSV cannot handle quoted newlines.

Only Spark and Wise ParaText successfully completed on medium1.csv (~1 TB). We ran only Wise ParaText on medium2.csv (~5 TB).

results matching ""

    No results matching ""