TL;DR

ParaText is fast

ParaText has a higher throughput than any of the other CSV readers tested, on every dataset tried.

Note: most figures are plotted in log-scale. Bars are omitted due to either a crash, an error, or an incompatibility.

ParaText is memory-efficient

Wise ParaText has the lowest overall memory footprint after loading a CSV file into Python.

ParaText vs. binary: still fast!

Binary files can often be ten times smaller than their CSV counterparts. Binary readers need to do little if any parsing, type inference, or error checking. ParaText often makes much more effective use of available I/O bandwidth than the binary readers we tested.

ParaText interacts well with Python

Wise ParaText is orders of magnitude faster when converting from an internal data frame (Dato SFrame, a Spark Data Frame, or Wise) to a Python data structure. ParaText can convert multi-gigabyte data sets to Python in seconds while Spark and Dato take minutes.

ParaText is "Medium Data"

ParaText can handle multi-TB files with ease on a properly configured machine.

medium1.csv (1.015 TB)

Method	Load Time	Sum Time	Transfer Time	Amazon Cost
Databricks SparkCSV	11 hours, 36 minutes, 59 seconds	11 hours, 16 minutes, 42 seconds	N/A	$156.14
Wise ParaText	7 minutes, 10 seconds	43.7 seconds	5 minutes, 52 seconds	$1.57
Wise ParaText Out-of-core	5 minutes, 35 seconds	included	N/A	$0.64

medium2.csv (5.076 TB)

Method	Total Time to Read, Parse, and Sum	Peak Memory Footprint	Amazon Cost
Wise ParaText Out-of-core	26 minutes, 19 seconds	140 MB	$2.99

ParaText is cheaper

The prorated Amazon Web Services (AWS) costs for Wise ParaText are much lower when compared to other methods.

ParaText approaches the limits of the hardware

ParaText out-of-core and line counting achieves a throughput close to the estimated I/O bandwidth.

ParaText is open source.

ParaText is available for download on GitHub under the Apache License.

TL;DR

TL;DR

ParaText is fast

ParaText is memory-efficient

ParaText vs. binary: still fast!

ParaText interacts well with Python

ParaText is "Medium Data"

ParaText is cheaper

ParaText approaches the limits of the hardware

ParaText is open source.

results matching ""

No results matching ""