F. Costs

We calculated the prorated Amazon EC2 costs (in USD, ec2-east-1) of the load & sum workload on all data sets. The cost of each method as a multiple of ParaText is shown.

While this simple ETL task is unlikely to be a dominant cost driver in any serious production workflow, the cost implications of inefficient readers is stark. Ultimately, as we have argued, the labor, software, and hardware costs must be together optimized in constructing and running such workflows.

The prorated Amazon cost for loading and summing each CSV file is shown. The work per dollar for ParaText CSV is greater in a cold state than a warm state. This is simply explained by the fact that ParaText acheived 2.36 GB/sec on the 1.05 TB medium1 in a cold state, but the experiment could not be conducted in a warm state (for obvious reasons).

Cold State

Method Total cost Total processed Work/USD
NumPy loadtxt $52.417 303.9 GB 5.8 GB/$
Databricks SparkCSV $191.355 1.315 TB 6.9 GB/$
R read.csv $10.040 55.2 GB 5.5 GB/$
R readr $8.155 303.8 GB 37.3 GB/$
R fread $5.297 303.8 GB 57.4 GB/$
Pandas read_csv $6.424 303.8 GB 47.3 GB/$
Dato SFrame $3.667 303.8 GB 82.8 GB/$
Wise Paratext CSV $1.352 1.387 TB 1.025 TB/$
Wise ParaText CSV out-of-core $4.172 6.899 TB 1.653 TB/$

Warm State

Method Total cost Total processed Work/USD
NumPy loadtxt $56.892 354.9 GB 6.2 GB/$
R read.csv $8.674 70.7 GB 8.1 GB/$
R readr $7.968 303.8 GB 38.1 GB/$
R fread $5.096 303.8 GB 59.6 GB/$
Databricks SparkCSV $35.795 299.6 GB 8.4 GB/$
Pandas read_csv $6.348 303.8 GB 47.9 GB/$
Dato SFrame $4.939 363.6 GB 73.6 GB/$
Wise Paratext CSV $0.406 371.8 GB 915.8 GB/$
Wise Paratext CSV out-of-core $0.247 807.6 GB 3.268 TB/$

results matching ""

    No results matching ""