TL;DR

ParaText is fast

ParaText has a higher throughput than any of the other CSV readers tested, on every dataset tried.

Note: most figures are plotted in log-scale. Bars are omitted due to either a crash, an error, or an incompatibility.

ParaText is memory-efficient

Wise ParaText has the lowest overall memory footprint after loading a CSV file into Python.

ParaText vs. binary: still fast!

Binary files can often be ten times smaller than their CSV counterparts. Binary readers need to do little if any parsing, type inference, or error checking. ParaText often makes much more effective use of available I/O bandwidth than the binary readers we tested.

ParaText interacts well with Python

Wise ParaText is orders of magnitude faster when converting from an internal data frame (Dato SFrame, a Spark Data Frame, or Wise) to a Python data structure. ParaText can convert multi-gigabyte data sets to Python in seconds while Spark and Dato take minutes.

ParaText is "Medium Data"

ParaText can handle multi-TB files with ease on a properly configured machine.

medium1.csv (1.015 TB)

Method Load Time Sum Time Transfer Time Amazon Cost
Databricks SparkCSV 11 hours, 36 minutes, 59 seconds 11 hours, 16 minutes, 42 seconds N/A $156.14
Wise ParaText 7 minutes, 10 seconds 43.7 seconds 5 minutes, 52 seconds $1.57
Wise ParaText Out-of-core 5 minutes, 35 seconds included N/A $0.64

medium2.csv (5.076 TB)

Method Total Time to Read, Parse, and Sum Peak Memory Footprint Amazon Cost
Wise ParaText Out-of-core 26 minutes, 19 seconds 140 MB $2.99

ParaText is cheaper

The prorated Amazon Web Services (AWS) costs for Wise ParaText are much lower when compared to other methods.

ParaText approaches the limits of the hardware

ParaText out-of-core and line counting achieves a throughput close to the estimated I/O bandwidth.

ParaText is open source.

ParaText is available for download on GitHub under the Apache License.

results matching ""

    No results matching ""