A long time ago, in an alternate universe, you rescue the king’s life. To show his gratitude, the king promises you one of his jewels as a reward. He takes out his bag of treasures and tells you that you can reach into the bag and randomly choose one. If you like it, you can keep it. If you don’t, you to place it into another bag and lose the opportunity to ever keep that jewel. Judging by the size of the bag, you estimate that there are 10 jewels inside.
How should you proceed in order to maximize your chances of scoring the best jewel in the bag?
Should you just take the first one you pick or should you look at a few before deciding on one to keep? If so, how many?
There is a concrete mathematical solution, but for me it’s easier to run some simulations and see what the outcome is. The histograms below show how often a jewel with a value between 0 and 9 is picked after seeing some of the jewels and then picking the next one which is better than any seen (or the last if the best was seen and discarded).
The results are striking! If you don’t look at any jewels and always pick the first one, you’re equally likely to get any of the jewels. That is, your chances of picking the best one are 1/9. If you look at one, discard it and then pick the next jewel which is better than the one you saw, your chances of picking the best jewel increase dramatically to almost 4 / 10.! In fact, the entire distribution gets skewed right and your chances of picking the second and third best jewels also significantly increase.
You’re after the best, however. From the histogram above, it’s hard to see how many you have to see and discard before you maximize your chances of picking the best jewel. For that we can either run a simulation with lots of jewels and lots of iterations. Or we can run simulations for increasing numbers of jewels and plot how many you have to look at, to maximize your chances of getting the best:
What this chart shows is that the number of your have to look at and throw out to maximize your chances of finding the best grows linearly with the number of jewels. The slope of the line (0.36), indicats the fraction of the total number of jewels that we should examine before picking the next best.
The red line shows how many you should look at if we want to maximize the average value of the jewel you end up with. If we try and maximize the chances of bagging the best jewel, we also increase the chances that you end up with the worst. When trying to get the highest value on average, you would thus have to look at fewer jewels before deciding to look for the best yet. How much fewer, in this case, between four and five times fewer.
But what if the values of the jewels are not uniformly distributed? What if most jewels are average in value and only a few are either precious or worthless? Or what if most jewels have a low value with only a few true gems? The next two sections show the results when the jewels are normally distributed (mostly average) and exponentially distributed (mostly low-value).
As long as you have more than one option and you want to get the best reward possible, always discard the first third (or 1/e‘th, where e is Euler’s Number, to be more precise) of the options and then pick the next one which is better than any you have seen so far. If you want something better than average but not necessarily the best, discard fewer. Or, to put it more succintly:
The example below shows how to create a graph that is both aesthetically
pleasing and sensibly proportioned. I used these parameters to generate the
majority of the figures in my PhD thesis. The savefig
function can also be
used to save a pdf or an svg file for modification in your favorite image
editor (e.g. inkscape).
The code to generate this plot uses the seaborn library in an IPython notebook:
One of the truly beautiful things about d3.js is how easy it is to combine multiple layouts into one graphic. The example below incorporates three differents layout to show the data along 3 different dimensions:
RNA secondary structure. RNA is a molecule similar to DNA with the property that it folds back onto itself to form pairs with bases in its own sequence. Its secondary structure is often displayed as an eponymous diagram. The circles representing nucleotides are arranged using an rnaplot layout.
Some quantity associated with each molecule (e.g. concentration). We can display this by scaling the secondary structure diagrams using a treemap layout. Each rnaplot is scaled as a treemap
Multiples of the above quantities (e.g. from different experiments). Four sets of structures and sizes are arranged using a grid layout.
So how do we create this layout chaining? We start with the d3-grid layout:
Now we have one appropriately positioned <g> for each treemap, under which we will construct the rna-treemap layout:
Finally, the rna-plot
layout adds its own <g> and
continuous on drawing the circles associated with the RNA (not shown).
And that’s it! Create layout function. Create child nodes bound to the data. Call layout function. Rinse, repeat! Take a look at the “Towards Reusable Charts” tutorial for an excellent introduction to creating a custom layout.
Addendum
Here’s the beginning of the json file used to create the plot:
[{"name": "graph", "children": [{"structure": "((..((....)).(((....))).))", "sequence": "CGCUUCAUAUAAUCCUAAUGACCUAU", "size": 50}, {"structure": "((...........(((....))).))", "sequence": "CGCUUCAUAUAAUCCUAAUGACCUAU", "size": 40}, {"structure": "..........................", "sequence": "CGCUUCAUAUAAUCCUAAUGACCUAU", "size": 20} ] }, {"name": "graph", "children": [{"structure": "...........(((((.....)))))..", "sequence": "CGCUUCAUAUAAUCCUAAUGACCUAU", "size": 50}, {"structure": "(((((((......)))).))).......", "sequence": "CGCUUCAUAUAAUCCUAAUGACCUAU", "size": 10} ] }, ...]