F. Costs

We calculated the prorated Amazon EC2 costs (in USD, ec2-east-1) of the load & sum workload on all data sets. The cost of each method as a multiple of ParaText is shown.

While this simple ETL task is unlikely to be a dominant cost driver in any serious production workflow, the cost implications of inefficient readers is stark. Ultimately, as we have argued, the labor, software, and hardware costs must be together optimized in constructing and running such workflows.

The prorated Amazon cost for loading and summing each CSV file is shown. The work per dollar for ParaText CSV is greater in a cold state than a warm state. This is simply explained by the fact that ParaText acheived 2.36 GB/sec on the 1.05 TB medium1 in a cold state, but the experiment could not be conducted in a warm state (for obvious reasons).

Cold State

Method	Total cost	Total processed	Work/USD
NumPy `loadtxt`	$52.417	303.9 GB	5.8 GB/$
Databricks SparkCSV	$191.355	1.315 TB	6.9 GB/$
R `read.csv`	$10.040	55.2 GB	5.5 GB/$
R `readr`	$8.155	303.8 GB	37.3 GB/$
R `fread`	$5.297	303.8 GB	57.4 GB/$
Pandas `read_csv`	$6.424	303.8 GB	47.3 GB/$
Dato SFrame	$3.667	303.8 GB	82.8 GB/$
Wise Paratext CSV	$1.352	1.387 TB	1.025 TB/$
Wise ParaText CSV out-of-core	$4.172	6.899 TB	1.653 TB/$

Warm State

Method	Total cost	Total processed	Work/USD
NumPy `loadtxt`	$56.892	354.9 GB	6.2 GB/$
R `read.csv`	$8.674	70.7 GB	8.1 GB/$
R `readr`	$7.968	303.8 GB	38.1 GB/$
R `fread`	$5.096	303.8 GB	59.6 GB/$
Databricks SparkCSV	$35.795	299.6 GB	8.4 GB/$
Pandas `read_csv`	$6.348	303.8 GB	47.9 GB/$
Dato SFrame	$4.939	363.6 GB	73.6 GB/$
Wise Paratext CSV	$0.406	371.8 GB	915.8 GB/$
Wise Paratext CSV out-of-core	$0.247	807.6 GB	3.268 TB/$

F. Costs

F. Costs

Cold State

Warm State

results matching ""

No results matching ""