Thanks to a suggestion from a reddit comment, I added benchmarks for the python code running under PyPy. This makes the results even more interesting. PyPy actually runs the join faster than Scala when more cores are present. On the other hand, it runs the sort slower, leading to an approximately equal performance when there are more than 2 cores available. This is really good news for people (like myself) who are more familiar with python and don’t want to learn another language just to execute faster Spark queries.
Apart from the extra data in the charts, the rest of this post is unmodified and thus doesn’t mention PyPy.
The fantastic Apache Spark framework provides an API for distributed data analysis and processing in three different languages: Scala, Java and Python. Being an ardent yet somewhat impatient Python user, I was curious if there would be a large advantage in using Scala to code my data processing tasks, so I created a small benchmark data processing script using Python, Scala, and SparkSQL.
The benchmark task consists of the following steps:
The dataset consists of two text file tables, weighing in at 297M and 229M.
Each of the scripts was run with a
collect statement at the end to ensure that
each step was executed. The time was recorded as the real time elapsed between
the start of the script and the end. The scripts were run using 1,2,4 and 8
worker cores (as set by the
The fastest performance was achieved when using SparkSQL with Scala. The slowest, SparkSQL with Python. The more cores used, the more equal the results. This is likely due to the fact that parallelizable tasks start to contribute less and less of the total time and so the running time becomes dominated by the collection and aggregation which must be run synchronously, take a long time and are largely language independent (i.e. possibly run by some internal Spark API).
To get a clearer picture of where the differences in performance lie, a
action was performed after each transformation and other action. The time was
recorded after each
This data indicates that just about every step in the Python implementation, except for the final sort, benefitted proportionally (~ 8x) from the extra cores. The Scala implementation, in contrast, showed no large speedup in any of the steps. The longest, the join and sort steps, ran about 1.5 times faster when using 8 cores vs when using just 1. This can be either because the dataset is too small to benefit from parallelization, given Scala’s already fast execution, or that something strange is going on with the settings and the operations performed by the master node are being run concurrently even when there are less worker nodes available.
This doesn’t appear to be case as running both the master and worker nodes on a machine with only four available cores (vs 24 in the previous benchmarks) and allowing only one worker core actually led to faster execution. A more comprehensive test would require running the master node on a single core machine and placing the workers on separate more capable computers. I’ll save that for another day though.
If you have less cores at your disposal, Scala is quite a bit faster than Python. As more cores are added, its advantage dwindles. Having more computing power gives you the opportunity to use alternative languages without having to wait for your results. If computing resources are at a premium, then it might make sense to learn a little bit of Scala, if only enough to be able to code SparkSQL queries.
The code for each programming language is listed in the sections below:
The number of workers was set in the
SPARK_WORKER_CORES variable in
The following code was pasted into its respective shell and timed.
The Nation Center for Biotechnology Information (NCBI) maintains an enormous amount of biological data and provides it all to the public for no cost as a collection of databases. One of the most popular is GenBank, which contains information about annotated genes. Consider the gene p53, which encodes a tumor suppressor protein, the absence of which allows many cancers to proliferate. By looking at its entry in GenBank, we can immediately find out its full name (tumor protein p53), which organism this entry corresponds to (Human), aliases (BCC7, LFS1, TRP53), a short description and a whole host of other technical information.
Among the information provided with each entry is a section which contains a list of papers which have referenced this gene. In a sense, each reference is a paper which has contributed some bit of knowledge about the function of this piece of DNA (or RNA). This got me wondering, which are the most studied genes? Which genes have made an appearance in the most published papers?
To answer this, I downloaded the table which contains the reference information from GenBank, performed some rudimentary analysis, and generated the following table of the top 20 most popular genes, as measured by the number of times they have been cited:
The graph above shows the number of references in PubMed to a particular gene in GenBank. The color of the bars refers to the organism that the gene is found in. It was made using d3.js and the script for generating it can be found here (github.com), while the data itself is located here (github.com).
The genes on the list can be broadly placed into 6 categories:
Cancer related - Tp53, Trp53 and all of the genes with ‘cancer’ (BRCA1), ‘tumor’ (TNF) or ‘growth factor’ (EGFR, VEGFA and TGFB) in their name are likely associated with cancer and are involved in either helping cells proliferate (oncogenes) or preventing them from becoming cancerous (tumor suppressor genes).
Immune system related - Interleukins, ‘nuclear factor kappa-light-chain-enhancer of activated B cells’ (also known as NF-κB) and major histocompatibility complex (MHC) are all associated with immune responses such as recognizing pathogens and mounting an attack against them.
HIV related - gp160 envelope glycoprotein (env) is one of the proteins on the surface of retroviruses which allow it to attach to and enter cells. Needless to say, it is extremely important in finding treatments and vaccines for such viruses.
Other disease - Apolipoprotein E (APOE) is involved in heart disease and Alzheimer’s disease, while Methylenetetrahydrofolate reductase (MTHFR) is associated with susceptibility to a variety of disorders including Alzheimer’s, colon cancer and others.
Regulatory - Ubiquitin (UBC) is a protein involved in the translocation and degradation (among other processes) of other proteins. Angiotensin-converting enzyme (ACE) is a regulatory enzyme which is involved in the control of blood pressure. Estrogen receptor 1 (ESR1) is a transcription factor which responds to the hormone estrogen, leading to a variety of downstream effects.
Other - The gene w (white), is popular largely due to its historical cachet. It was the first mutation to be discovered which did not display typical Mendelian inheritance due to its location on a sex-chromosome in D. melanogaster. Gene trap ROSA 26 (gt(ROSA)26Sor) is simply a convenient place to insert genes for study in a mouse model.
Immediately evident is the overrepresentation of disease-related genes. 15 of the 20 genes are heavily involved in some human disease. The remaining entries are either regulatory (UBC, ACE and ESR1), historic (w) or just simply useful (Gt(ROSA)26Sor). The majority come from human, followed by mouse (used to express genes also found in humans: Tnf and Trp53), and finally HIV and Drosophila. This is something of a reflection of where our interests and funding lie. The two most studied genes are involved in cancer, research in which is both well-funded and heavily reliant on genetic analysis. Four on the list are associated with the immune system (IL6, IL10, NFKB1, and HLA-DRB1), two (APOE and ACE) are associated with heart disease and one with HIV. We focus the majority of our attention on the things which are likely to kill us.
Conspicuously absent from the list are any genes from plants or genes involved in metabolism. Important pathways such as differentiation, DNA replication and protein synthesis are all absent. That’s not to say that they are not studied, it’s just that they recieve less attention than processes involved in our demise. Then again, the age of molecular genetics has only begun in the last century or so. Perhaps our interests will shift in the future as we find cures and treatments for existing maladies and start having to deal with others such as a a changing climate, energy crises and an aging population. Biology may hold partial solutions to these problems and the proportional amount of effort we put into finding processes to remove carbon dioxide from the air, to produce fuels from biomatter or to limit or reverse aging may grow to eclipse that put into research in the current top-20 genes.
Quite often, when I try to attach to an existing tmux session, the following error pops up:
It seems like
tmux has disappeared or crashed. Fortunately, to this date that
has never been the case. It’s just a simple case of a deleted socket. To cut a
long story short, fixing it requires sending
tmux a signal to recreate the
Here’s a reference to a stackexchange question which gives slightly more information.