E. Medium Data (1 TB+) on a Single Machine

We now evaluate each method on files of 1 TB+ in size. There are much better ways to store terascale data. We certainly do not recommend CSV files of this scale. However, we wanted to see if ParaText and any other tools still cut the mustard.

medium1: 1.015 TB CSV File

CSV files are often much bigger than in memory. We therefore tried to load a ~1 TB CSV file. It seemed reasonable that a few of the methods might succeed in loading the data.

Method Load Time Sum Time Transfer Time Amazon Cost
Databricks SparkCSV 11 hours, 36 minutes, 59 seconds 11 hours, 16 minutes, 42 seconds N/A $156.14
Wise ParaText 7 minutes, 10 seconds 43.7 seconds 5 minutes, 52 seconds $1.57
Wise ParaText Out-of-core 5 minutes, 35 seconds included N/A $0.64

Only SparkCSV and ParaText could load this file to completion. However, Spark ran out of memory when converting a Spark DataFrame to Pandas. We also show the prorated AWS cost.

medium2: 5.076 TB CSV File

The ~5 TB CSV file was too big to fit into memory. ParaText in out-of-core mode can parse, type-check, and sum the columns in under a half-hour.

Method Total Time to Read, Parse, and Sum Peak Memory Footprint Amazon Cost
Wise ParaText Out-of-core 26 minutes, 19 seconds 140 MB $2.99

Note: The memory footprint is not a typo. Each worker thread stores just the sum for each column and the number of lines parsed. Most of the memory consumed is the bare minimum overhead needed to launch Python.

results matching ""

    No results matching ""