Powered by GitBook

E. Medium Data (1 TB+) on a Single Machine

We now evaluate each method on files of 1 TB+ in size. There are much better ways to store terascale data. We certainly do not recommend CSV files of this scale. However, we wanted to see if ParaText and any other tools still cut the mustard.

medium1: 1.015 TB CSV File

CSV files are often much bigger than in memory. We therefore tried to load a ~1 TB CSV file. It seemed reasonable that a few of the methods might succeed in loading the data.

Method	Load Time	Sum Time	Transfer Time	Amazon Cost
Databricks SparkCSV	11 hours, 36 minutes, 59 seconds	11 hours, 16 minutes, 42 seconds	N/A	$156.14
Wise ParaText	7 minutes, 10 seconds	43.7 seconds	5 minutes, 52 seconds	$1.57
Wise ParaText Out-of-core	5 minutes, 35 seconds	included	N/A	$0.64

Only SparkCSV and ParaText could load this file to completion. However, Spark ran out of memory when converting a Spark DataFrame to Pandas. We also show the prorated AWS cost.

medium2: 5.076 TB CSV File

The ~5 TB CSV file was too big to fit into memory. ParaText in out-of-core mode can parse, type-check, and sum the columns in under a half-hour.

Method	Total Time to Read, Parse, and Sum	Peak Memory Footprint	Amazon Cost
Wise ParaText Out-of-core	26 minutes, 19 seconds	140 MB	$2.99

Note: The memory footprint is not a typo. Each worker thread stores just the sum for each column and the number of lines parsed. Most of the memory consumed is the bare minimum overhead needed to launch Python.

results matching ""

No results matching ""