C. Memory Footprint

We now compare the memory footprints of the methods. Crashes due to insufficient memory are fatal so memory efficiency is very important for production systems. We measure the total memory footprint to load a CSV file, sum its columns, and convert it to Python.

ParaText uses substantially less memory than other approaches on numeric data sets. Dato SFrame had low usage on text data sets.

ParaText uses the least RAM for all the float-heavy CSV files in our benchmarks. The memory footprint increases when a Dato SFrame, Spark DataFrame, and ParaText internal DataFrame are converted to Python. ParaText uses up to 30x less memory than Spark and 10x less memory than Dato.

Spark reserves a large heap in advance and spills data to disk as needed so it is hard to make claims about its memory efficiency. This makes it a black art to provision resources for Spark so that production workloads are safe from crashes.

results matching ""

    No results matching ""