E. Medium Data (1 TB+) on a Single Machine
We now evaluate each method on files of 1 TB+ in size. There are much better ways to store terascale data. We certainly do not recommend CSV files of this scale. However, we wanted to see if ParaText and any other tools still cut the mustard.
medium1: 1.015 TB CSV File
CSV files are often much bigger than in memory. We therefore tried to load a ~1 TB CSV file. It seemed reasonable that a few of the methods might succeed in loading the data.
Method | Load Time | Sum Time | Transfer Time | Amazon Cost |
---|---|---|---|---|
Databricks SparkCSV | 11 hours, 36 minutes, 59 seconds | 11 hours, 16 minutes, 42 seconds | N/A | $156.14 |
Wise ParaText | 7 minutes, 10 seconds | 43.7 seconds | 5 minutes, 52 seconds | $1.57 |
Wise ParaText Out-of-core | 5 minutes, 35 seconds | included | N/A | $0.64 |
Only SparkCSV and ParaText could load this file to completion. However, Spark ran out of memory when converting a Spark DataFrame to Pandas. We also show the prorated AWS cost.
medium2: 5.076 TB CSV File
The ~5 TB CSV file was too big to fit into memory. ParaText in out-of-core mode can parse, type-check, and sum the columns in under a half-hour.
Method | Total Time to Read, Parse, and Sum | Peak Memory Footprint | Amazon Cost |
---|---|---|---|
Wise ParaText Out-of-core | 26 minutes, 19 seconds | 140 MB | $2.99 |
Note: The memory footprint is not a typo. Each worker thread stores just the sum for each column and the number of lines parsed. Most of the memory consumed is the bare minimum overhead needed to launch Python.