C. Spark Setup

This benchmark focuses solely on single machine setups. Though Spark is designed for medium to large scale clusters, its performance on a single, large machine is understudied. Undoubtedly, there are many workloads for which a properly tuned Spark cluster reigns supreme in runtime over a single machine implementation. This benchmark tries to address how well Spark makes use of single machine to load data.

Tuning Spark

Spark has a lot of knobs to tune relative to the other methods. We tweaked each parameter to get better performance to put Spark on the highest footing on a single machine: high parallelism and prefer in-memory caching. The disk array mount is used for local disk caching by setting export SPARK_LOCAL_DIRS=/raid0-2.

  • driver-memory: 300G
  • executor-memory: 300G
  • num-executors: 32
  • spark.driver.maxResultSize: 10G
  • packages: com.databricks:spark-csv_2.11:1.4.0

The memory settings seem a bit large, but we encountered some issues when driver-memory and executor-memory were set to lower values. First, the toPandas() step fails for the larger 10-30 GB data sets. Second, Spark does not make use of the more than ample RAM available. Most of the data sets in our evaluation are 1/10th the size of RAM so there should be very little if any spillage to disk.

Before further processing, the RDDs backing the Spark Data Frame are cached in memory:

sdf.cache()

The benchmark log files record a lot of the characteristics of each Spark run including the number of RDD partitions in each loaded Data Frame. The number of RDD partitions for each data set is shown (except for medium1.csv1).

The number of partitioning was always greater than the number of cores except on the smallest data set. Spark Context's default parallelism (sc.defaultParallelism) was equal to the number of cores (32) in all cases.

We used spark-submit to run each SparkCSV benchmark:

spark-submit --driver-memory 300G --executor-memory 300G \
             --num-executors 32 --conf spark.driver.maxResultSize=10g \
             --packages com.databricks:spark-csv_2.11:1.4.0 \
             run_experiment.py ${json_benchmark_file}

The SparkContext is instructed to use all available cores:

sc = SparkContext("local[*]", "SparkCSV Benchmark", pyFiles=[])
sqlContext = SQLContext(sc)

results matching ""

    No results matching ""