D. Copy Overhead of Spark, Dato, and ParaText

Spark DataFrame and Dato SFrame can convert from their internal representations to a Python object in one line of code:

# Spark
spark_data_frame.toPandas()

# Dato
dato_sframe.to_dataframe()

This conversion is an important part of the data scientist's interactive experience. A 10 TB DataFrame may be too big to fit into memory, but the result of a .aggregate() could be as small as 1 GB. This result could be available in Python within a few seconds. However, it often takes many minutes in Spark and Dato.

On every data set we tried, ParaText is orders of magnitude faster than Dato and Spark when exposing data in Python. Spark took about 16 minutes to convert a DataFrame from a 25 GB CSV file, Dato took about 5 minutes, and Wise ParaText took about 2 seconds. A large contributor to the interactive sluggishness of Spark and Dato is rooted in their heavy use of serialization.

results matching ""

    No results matching ""