Thanks to a suggestion from a reddit comment, I added benchmarks for the python
code running under PyPy. This makes the results even more interesting. PyPy
actually runs the join faster than Scala when more cores are present. On the
other hand, it runs the sort slower, leading to an approximately equal
performance when there are more than 2 cores available. This is really good
news for people (like myself) who are more familiar with python and don't want
to learn another language just to execute faster Spark queries.
Apart from the extra data in the charts, the rest of this post is unmodified
and thus doesn't mention PyPy.
The fantastic Apache Spark framework provides an
API for distributed data analysis and processing in three different languages:
Scala, Java and Python. Being an ardent yet somewhat impatient Python user, I
was curious if there would be a large advantage in using Scala to code my data
processing tasks, so I created a small benchmark data processing script using
Python, Scala, and SparkSQL.
The benchmark task consists of the following steps:
Load a tab-separated table (gene2pubmed), and convert string values to integers (map, filter)
Load another table (pmid_year), parse dates and convert to integers (map)
Join the two tables on a key (join)
Rearrange the keys and values (map)
Count the number of occurances of a key (reduceByKey)
Rearrange the keys and values (map)
Sort by key (sortByKey)
The dataset consists of two text file tables, weighing in at 297M and 229M.
Total Running Time
Each of the scripts was run with a collect statement at the end to ensure that
each step was executed. The time was recorded as the real time elapsed between
the start of the script and the end. The scripts were run using 1,2,4 and 8
worker cores (as set by the SPARK_WORKER_CORES option).
The fastest performance was achieved when using SparkSQL with Scala. The
slowest, SparkSQL with Python. The more cores used, the more equal the results.
This is likely due to the fact that parallelizable tasks start to contribute
less and less of the total time and so the running time becomes dominated by
the collection and aggregation which must be run synchronously, take a long
time and are largely language independent (i.e. possibly run by some internal
To get a clearer picture of where the differences in performance lie, a count()
action was performed after each transformation and other action. The time was
recorded after each count().
This data indicates that just about every step in the Python implementation,
except for the final sort, benefitted proportionally (~ 8x) from the extra cores. The
Scala implementation, in contrast, showed no large speedup in any of the steps.
The longest, the join and sort steps, ran about 1.5 times faster when using 8
cores vs when using just 1. This can be either because the dataset is too small
to benefit from parallelization, given Scala's already fast execution, or
that something strange is going on with the settings and the operations
performed by the master node are being run concurrently even when there are less
worker nodes available.
This doesn't appear to be case as running both the master and worker nodes on a
machine with only four available cores (vs 24 in the previous benchmarks) and
allowing only one worker core actually led to faster execution. A more
comprehensive test would require running the master node on a single core
machine and placing the workers on separate more capable computers. I'll save
that for another day though.
If you have less cores at your disposal, Scala is quite a bit faster than
Python. As more cores are added, its advantage dwindles. Having more computing
power gives you the opportunity to use alternative languages
without having to wait for your results. If computing resources are at a premium,
then it might make sense to learn a little bit of Scala, if only enough to be
able to code SparkSQL queries.
The code for each programming language is listed in the sections below:
Master and Worker Nodes
The number of workers was set in the SPARK_WORKER_CORES variable in conf/spark-env.sh
The following code was pasted into its respective shell and timed.