Tuesday 18 June 2019

bioinformatics - Sequence evolution simulation tool


I'm looking for a tool to simulate sequence evolution given a specific mutation model and birth-death model. I'm aware of tools and packages like INDELible, Seq-Gen and PhyloSim, but they simulate evolution along phylogenetic trees. What I want is to give a sequence and ask the tool to simulate the evolution of this sequence through N generations. The output should be a big bunch of descendant sequences. I've been thinking about writing such a tool myself, but it's always better to look for something already coded.



Answer



There exists a bunch of population genetics forward and backward (coalescence) simulation platforms. Here is a non-exhaustive list. They all differ and you'll have to go through their manual to see what is more adapted to your needs.



Here is a long list of such platforms. The list might arguably be a little bit outdated today though and many of these softwares might be slow or abandoned by their authors.



Some are more known than others. Personally, I already saw uses of the following platforms in publication: SimCoal, Nemo, Slim and SFS_Code.




Of course, I must indicate my own simulation platform SimBit. To my experience, SimBit is typically faster than Nemo. It is slower than SFS_Code and SLim for very small simulations but become much much faster than SFS_Code and SLim when you need simulations with a descent amount of genetic diversity.



You should compare softwares based on



  • Availability of the author to advice and bring new features

    • I won't talk about my personal experience here by respect for the authors!



  • Speed / RAM usage


    • I think SimBit is typically faster (and use less RAM) than Nemo. SFS_code and SLim are very fast for simulations with low genetic diversity but very slow for simulations with high genetic diversity.



  • Flexibility

    • Nemo, SLim and SimBit are very flexible (but they do different things) but SFS_code is not really.



  • User interface


    • I find Nemo error report pretty poor. I like SimBit error report. SLim interface is very nice (using eidos) and it even comes with a GUI (I have never used the GUI myself).



  • Free of Bugs

    • Several people have found important bugs in SFS_code (personal communication and experience).






I have personally used NEMO, SFS_CODE and SLim in the past (I now use SimBit only). So I can only talk about these 4 below.


All four models are well updated and maintained by their authors and of course everyone has a access to the source code.


SFS_code and SLim are very fast for small simulations. The issue of SFS_code and SLim is that the RAM usage and run time are exponential function of the genetic diversity. This means that some simulations can quickly become totally unmanageable. But it is a great tool if you expect little genetic diversity.


SimBit is typically faster than Nemo, at least for large genomes. This difference becomes particularly obvious for large genomes. One might think "Oh I am fine to wait a few extra days to get my results" but the difference in speed between two softwares can also be a difference in whether you have to wait a week or 20 years to get your results, so do not neglect the importance of running time and RAM usage without estimating your needs.


Nemo is very flexible when it comes to the order of life-cycle events (is migration before or after selection for example) and the newest version allows for simulating age structured population. SimBit is very flexible, very fast and allow for simulating multi-species and their ecological interactions. SFS_code is less flexible.





My only concerns about NEMO are that each individual dies after reproduction and that there can't be more than 256 alleles at the same time.




SFS_code limits the number of alleles to 4. Nemo limits the number of alleles to 256 (but you can also use phenotype and have a float number at each locus). SimBit uses either bi-allelic loci, a block in which the number of mutations are counted (max: 256 mutations; this is different than 256 alleles per say) or continuous float number at each locus for the phenotype. Do you really need more alleles? Why can't you just use several loci to represent a sequence with multiple alleles? Real biology can only have 4 alleles in its smallest locus! SimBit uses a bit-per-bit system. If you need, say 50,000 alleles, then you just need to ask for a sequence of 16 bits (2 bytes) and you can get $2^{16} = 65,536$ alleles.


If you need more alleles in a single block (which is a little surprising to me a priori), you can probably just ask the author and he might accept to helping you if you convince him it is of interest to add this feature (he might also be more willing to help in exchange of authorship if you need more support).


Hope that helps. Good luck!


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...