Data Sets
We evaluate the CSV readers in this benchmark on four different categories of data sets:
- categorical-heavy: contains some numeric columns, but many strings values are repeated. In these data sets, there are string columns with 100s or 1000s of unique strings with millions of rows
- mixed: mixed categorical and numeric data with several text columns with strings spanning multiple lines
- integer-heavy: mostly integer data
- floating-point-heavy: mostly floating point data
We used 11 CSV files in total. The first 9 files are available for download by creating an EBS volume from our public EBS snapshot snap-e96b3609
. Appendix E explains how to generate medium1
and medium2
. The CSV files are converted to binary files (NPY, Feather, HDF5, Pickle) as Appendix F explains in detail. The characteristics of these data sets are shown in the following table.
Data Set | Rows | Columns | Type | CSV Size | Machine |
---|---|---|---|---|---|
messy | 1,000 | 1,000 | mixed | 21 MB | A |
messy2 | 100,000 | 1,000 | mixed | 2.10 GB | A |
car | 13,051,349 | 67 | categorical-heavy | 6.71 GB | A |
mnist | 64000 | 785 | integer-heavy | 55 MB | A |
mnist8m | 8100000 | 785 | integer-heavy | 14.96 GB | A |
floats | 40,000 | 2,000 | float-heavy | 1.21 GB | A |
floats2 | 1,000,000 | 1,000 | float-heavy | 25.50 GB | A |
floats3 | 10,000,000 | 100 | float-heavy | 25.50 GB | B |
floats4 | 100,000,000 | 10 | float-heavy | 25.50 GB | B |
medium1 | 7,000,000,000 | 5 | float-heavy | 1.015 TB | B |
medium2 | 35,000,000,000 | 5 | float-heavy | 5.076 TB | B |
We loaded each CSV file with ParaText and then saved the result to NPY, HDF5, Pickle, and Feather formats. This puts the binary readers at a further advantage over ParaText because ParaText chooses the narrowest type for numeric data. Thus, the binary files are much more compact than their CSV counterparts.
A comparison of runtimes does not give much insight into how each reader exploits the storage system. First, the input file sizes are drastically different. Second, runtimes do not measure how effectively a method makes use of the I/O bandwidth.
Exclusions
It was not possible to run every method on every data set. We list the cases below.
Method | Data Sets | Reason |
---|---|---|
numpy.loadtxt | car, messy, messy2 | It is designed primarily for numeric data. |
NPY | car, messy, messy2 | Same. |
HDF5 | car, messy, messy2 | H5PY is difficult to use for mixed column types. |
SparkCSV | messy, messy2 | SparkCSV cannot handle quoted newlines. |
Only Spark and Wise ParaText successfully completed on medium1.csv
(~1 TB). We ran only Wise ParaText on medium2.csv
(~5 TB).