Data Sets

We evaluate the CSV readers in this benchmark on four different categories of data sets:

categorical-heavy: contains some numeric columns, but many strings values are repeated. In these data sets, there are string columns with 100s or 1000s of unique strings with millions of rows
mixed: mixed categorical and numeric data with several text columns with strings spanning multiple lines
integer-heavy: mostly integer data
floating-point-heavy: mostly floating point data

We used 11 CSV files in total. The first 9 files are available for download by creating an EBS volume from our public EBS snapshot snap-e96b3609. Appendix E explains how to generate medium1 and medium2. The CSV files are converted to binary files (NPY, Feather, HDF5, Pickle) as Appendix F explains in detail. The characteristics of these data sets are shown in the following table.

Data Set	Rows	Columns	Type	CSV Size	Machine
messy	1,000	1,000	mixed	21 MB	A
messy2	100,000	1,000	mixed	2.10 GB	A
car	13,051,349	67	categorical-heavy	6.71 GB	A
mnist	64000	785	integer-heavy	55 MB	A
mnist8m	8100000	785	integer-heavy	14.96 GB	A
floats	40,000	2,000	float-heavy	1.21 GB	A
floats2	1,000,000	1,000	float-heavy	25.50 GB	A
floats3	10,000,000	100	float-heavy	25.50 GB	B
floats4	100,000,000	10	float-heavy	25.50 GB	B
medium1	7,000,000,000	5	float-heavy	1.015 TB	B
medium2	35,000,000,000	5	float-heavy	5.076 TB	B

We loaded each CSV file with ParaText and then saved the result to NPY, HDF5, Pickle, and Feather formats. This puts the binary readers at a further advantage over ParaText because ParaText chooses the narrowest type for numeric data. Thus, the binary files are much more compact than their CSV counterparts.

A comparison of runtimes does not give much insight into how each reader exploits the storage system. First, the input file sizes are drastically different. Second, runtimes do not measure how effectively a method makes use of the I/O bandwidth.

Exclusions

It was not possible to run every method on every data set. We list the cases below.

Method	Data Sets	Reason
numpy.loadtxt	car, messy, messy2	It is designed primarily for numeric data.
NPY	car, messy, messy2	Same.
HDF5	car, messy, messy2	H5PY is difficult to use for mixed column types.
SparkCSV	messy, messy2	SparkCSV cannot handle quoted newlines.

Only Spark and Wise ParaText successfully completed on medium1.csv (~1 TB). We ran only Wise ParaText on medium2.csv (~5 TB).

Data Sets

Data Sets

Exclusions

results matching ""

No results matching ""