Jen Hoyal Cuthill and I just had our paper published by PLoS One, on heliconius butterflies and how maybe they are a good case of codivergence after all.
Heliconius butterflies have for a long time been held up as a great example of coevolution as they mimic each other very closely, bit this fell from favour when trees based on molecular gene sequences didn’t show much similarity. We analysed some more data and used more recently developed methods and we think the old story might indeedbe the right one.
One species, H erato, looks such a lot like its closely related species H melpomene, that if you weren’t an expert you’d say they were the same species; but erato and melpomene both form a tree of related races within each species, where the different races in the two species look much more like their co-mimics than like their closest relatives in the same species. What’s curious is that the butterflies are both unpalatable to predators so they get what might be thought of as only a slight fitness advantage from looking like each other, so giving a slightly stronger signal to predators that they’re not that pleasant to eat; yet the race trees look remarkably similar to each other. The similarity suggests they may have been codiverging over about half a million years after all.
Pretty cool. Shiny.
TreeMap 2 and 3 use randomization tests to estimate the significance of congruence between the host and the parasite/pathogen tree.
These are Monte Carlo estimates because it’s the same randomization test over and over again, and they’re independent. The p value is the expected proportion of times the randomized parasite/pathogen tree can be mapped into the original host tree at least as well as the original parasite tree was, and it’s estimated by these repeated, independent, trials.
That in turn means the value returned gets more and more accurate the bigger the sample size (number of replicates) and it also has an error, which can be estimated and is reported. That’s why you get a p and a “+/-” on TreeMap 2. On TreeMap 3 (available here) I’ve altered the method of calculating the 95% confidence interval to one by Wilson (described on Wikipedia), which I think is a better idea. So you’ll get something like “p = 0.001 [0.0008, 0.0015]” at the end of your test.
It’s important to realise that increasing the number of replicates will increase the accuracy of your estimate of p, but it’s still an estimate only. Don’t read too much into it.
Also it’s important to know that the p value you estimate is based on constraints that you give to the test. That means you don’t even have to try to find the optimal map in advance, but it does help to know what kind of value you’re looking for. You can test on the minimum number of codivergence events you require, or the maximum total number of non-codivergence events, or maximum numbers of duplications / losses / host switches in some combinations. I usually just test on the number of codivergence events, and so do most people.
The last (subtle?) thing is that the searching method used must be the same for the original map and for the many random replicates. You must not do a search with (say) 1000 random starting points for the original, and then for some reason only use 10 random starts for each replicate. No!
There will be more updates on the cophylogeny site about this too.
I’ve updated TreeMap again and created a site on Google sites: sites.google.com/site/cophylogeny.
TreeMap3 does no better at some things than does TreeMap2, though some things it does much better.
One thing it can’t solve, either by clever design (and it’s better inside now, trust me) or by writing it in Java, is that whole NP-completeness thing: finding a minimal total cost mapping from the dependent phylogeny, say P, into the independent one, say H, is in a bunch of problems for which no fast solution is known to exist, and if there is one, then a whole mass of problems will suddenly become solvable in quick time. That means that guaranteed optimal solutions take a really long time to find, and it’s not something that getting a faster computer will solve.
So I’m forced to turn to heuristics to find the best possible maps. The problem is, the landscape I’ve come up with in my first attempt is really bad.
How is it bad? Oh, and what’s a landscape here?
Ok so the landscape of a problem like this one, often called a combinatorial optimization problem because it involves finding an optimal solution that’s a combination of bits, is a kind of a mathematical graph, whose points are solutions, and these points are connected to each other with lines. The points each have a score – how good the solution is, say, like “how well does it explain my data?” kind of score, or “probability that under such-and-such conditions I got this observation.” The lines connect solutions that are adjacent, in the sense that I could make some particular kind of alteration to one solution (remember these are maps we’re talking about, so it might be “move this ancestral parasite species from here to there on the host tree”) to get from one to another.
So we have this construct, a weighted mathematical graph, whose points are given values that I could visualise as their heights, and the links between them enable me to wander about looking for the best map, that is, the point that’s highest. It’s a bit like trying to find the highest point on a mountain range on tight-ropes, in a fog. Erk.
Well there are a few ways one can move around such a landscape of maps. One is simply to move among maps from one to the next by changing a single association at a time. Not the associations at the tips, but the internal associations, like this:
A single application of the perturbation above will never yield a solution that is better than the initial one on the left, so that initial solution is locally optimal and we are stuck. So what’s to be done? There are a number of options:
- use a different perturbation type (of course)
- use (multiple) randomized starting points to start the heuristic search (but these are also prone to getting caught in local optima)
- don’t do hill-climbing: use a search method that permits worse solutions on the path to finding the best one(s) (and true cophylogenetic happiness)
- apply a graduate student to the problem
I’ll keep you posted…
The figure was constructed for a manuscript on circoviruses, by hand, from 15 Pareto-optimal maps of the virus phylogeny into the host one. The idea is to show a load of information about the possible explanations of why the two phylogenies are this way, but I’m not entirely convinced that it’s the right way to go about it.
So how simple is this to interpret? Is it obvious what’s going on?
I want to make cophylogeny maps as clear as possible. There’s apparently been some research on interpretation of trees that shows people have an easier time interpreting trees with curved lines than angles or “elbows”, like these:
I quite like the left one with curves, but when you add in codivergence events it gets hard to see what’s going on. On the other hand: pretty!
Probably the best solution will be to have something somewhere in between angles and curves on the map. More on that later.
I’m in Glasgow working with Rod Page again, this time working on TreeMap3, this time in Java. This is essentially so I won’t be embarrassed about people having to retain old Mac Classics to run TreeMap2.
It already does some things much better than the previous version, but there remains much to do!