F. Costs
We calculated the prorated Amazon EC2 costs (in USD, ec2-east-1
) of the load & sum workload on all data sets. The cost of each method as a multiple of ParaText is shown.
While this simple ETL task is unlikely to be a dominant cost driver in any serious production workflow, the cost implications of inefficient readers is stark. Ultimately, as we have argued, the labor, software, and hardware costs must be together optimized in constructing and running such workflows.
The prorated Amazon cost for loading and summing each CSV file is shown. The work per dollar for ParaText CSV is greater in a cold state than a warm state. This is simply explained by the fact that ParaText acheived 2.36 GB/sec on the 1.05 TB medium1
in a cold state, but the experiment could not be conducted in a warm state (for obvious reasons).
Cold State
Method | Total cost | Total processed | Work/USD |
---|---|---|---|
NumPy loadtxt |
$52.417 | 303.9 GB | 5.8 GB/$ |
Databricks SparkCSV | $191.355 | 1.315 TB | 6.9 GB/$ |
R read.csv |
$10.040 | 55.2 GB | 5.5 GB/$ |
R readr |
$8.155 | 303.8 GB | 37.3 GB/$ |
R fread |
$5.297 | 303.8 GB | 57.4 GB/$ |
Pandas read_csv |
$6.424 | 303.8 GB | 47.3 GB/$ |
Dato SFrame | $3.667 | 303.8 GB | 82.8 GB/$ |
Wise Paratext CSV | $1.352 | 1.387 TB | 1.025 TB/$ |
Wise ParaText CSV out-of-core | $4.172 | 6.899 TB | 1.653 TB/$ |
Warm State
Method | Total cost | Total processed | Work/USD |
---|---|---|---|
NumPy loadtxt |
$56.892 | 354.9 GB | 6.2 GB/$ |
R read.csv |
$8.674 | 70.7 GB | 8.1 GB/$ |
R readr |
$7.968 | 303.8 GB | 38.1 GB/$ |
R fread |
$5.096 | 303.8 GB | 59.6 GB/$ |
Databricks SparkCSV | $35.795 | 299.6 GB | 8.4 GB/$ |
Pandas read_csv |
$6.348 | 303.8 GB | 47.9 GB/$ |
Dato SFrame | $4.939 | 363.6 GB | 73.6 GB/$ |
Wise Paratext CSV | $0.406 | 371.8 GB | 915.8 GB/$ |
Wise Paratext CSV out-of-core | $0.247 | 807.6 GB | 3.268 TB/$ |