{"id":6,"date":"2013-01-30T18:22:35","date_gmt":"2013-01-30T18:22:35","guid":{"rendered":"http:\/\/evolvedmicrobe.com\/blogs\/?p=6"},"modified":"2013-01-30T18:22:35","modified_gmt":"2013-01-30T18:22:35","slug":"not-all-poisson-random-variables-are-created-equally","status":"publish","type":"post","link":"http:\/\/evolvedmicrobe.com\/blogs\/?p=6","title":{"rendered":"Not All Poisson Random Variables Are Created Equally"},"content":{"rendered":"Spurred by a slow running program, I spent an afternoon researching what algorithms are available for generating Poisson random variables and figuring out which methods are used by R, Matlab, NumPy, the GNU Science Libraray and various other available packages. I learned some things that I think would be useful to have on the internet somewhere, so I thought it would be good material for a first blog post.\r\n\r\nMany meaningfully different algorithms exists to simulate from a Poisson distribution and commonly used code libraries and programs have not settled on the same method.\u00a0I found some of the available tools use inefficient methods, though figuring out which method a package used often required me to\u00a0 work through the source code.\r\n<div>\r\n\r\n<div id=\"attachment_7\" style=\"width: 310px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg\"><img aria-describedby=\"caption-attachment-7\" data-attachment-id=\"7\" data-permalink=\"http:\/\/evolvedmicrobe.com\/blogs\/?attachment_id=7\" data-orig-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg?fit=384%2C306\" data-orig-size=\"384,306\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Poisson Distribution\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;(from Wikipedia)&lt;\/p&gt;\n\" data-medium-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg?fit=300%2C239\" data-large-file=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg?fit=384%2C306\" loading=\"lazy\" class=\"size-medium wp-image-7\" alt=\"(from Wikipedia)\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg?resize=300%2C239\" width=\"300\" height=\"239\" srcset=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg?resize=300%2C239 300w, https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/01\/img34.jpg?w=384 384w\" sizes=\"(max-width: 300px) 100vw, 300px\" data-recalc-dims=\"1\" \/><\/a><p id=\"caption-attachment-7\" class=\"wp-caption-text\">(from Wikipedia)<\/p><\/div>\r\n\r\n<\/div>\r\nThis blog post explains the specific algorithms that commonly used packages implement. For those looking for the quick take away, the lesson is to <em>avoid <\/em>the default <strong>Matlab<\/strong> implementation and the <strong>GNU Science Library (GSL).<\/strong>\r\n\r\nFor context, when deciding which Poisson simulation routine to use there are essentially two important questions.\u00a0The first is if you will be repeatedly simulating values conditioned on the same mean parameter value (\u03bb).\u00a0 If so, methods that that store certain pre-computed information offer very big performance improvements.\u00a0 However, essentially no package implements these routines (with some exceptions discussed at the end).\u00a0 The second question is if you are frequently going to simulate with a \u03bb &gt; 10.\u00a0 Poisson simulation methods usually switch between two different methods depending on if \u03bb is above or below 10 (this &#8220;decision value&#8221; can change slightly but seems to always be &gt;=10).\u00a0 When \u03bb is below 10, the algorithms are all very similar and their speed scales with the value of \u03bb.\u00a0 Where the algorithms really differ is how fast they are when\u00a0 \u03bb &gt; 10.\r\n\r\nFor \u03bb &gt; 10, the <strong><a href=\"http:\/\/www.gnu.org\/software\/gsl\/\">GNU Science Library (GSL)<\/a> <\/strong>and <strong> <a href=\"http:\/\/www.mathworks.com\/products\/matlab\/\">Matlab<\/a> <\/strong>have slower algorithms that are of order log(\u03bb).\u00a0 Their methods are straightforward implementations from pages 137-138 of Knuth&#8217;s Seminumerical Algorithms book\u00a0 (specifically the 1998 version, older versions had a more inaccurate algorithm). On these pages Knuth&#8217;s book presents a 1974 algorithm by Aherns and Dieter [1] but he unfortunately only introduces the more efficient algorithm from Aherns and Dieters&#8217; 1982 paper [2] with an excercise buried at the end of the chapter.\u00a0 This was an understandable decision given the complexity of the 1982 method, but it is somewhat unfortunate for Matlab and GSL users.\r\n\r\nA major breakthrough in algorithms for Poisson random generators came with the publication of the first O(1) algorithm in 1979 by Atkinson [3] and methods from then on are signifcantly better. The <a href=\"http:\/\/www.mathdotnet.com\/\"><strong>Math.NET<\/strong><\/a>\u00a0 project uses this 1979 algorithm in their sampler.\u00a0 A still better method is the 1982 Aherns and Dieter method which is also heavily used.\u00a0 For example, the <strong><a href=\"http:\/\/www.r-project.org\/\">R Project<\/a><\/strong>\u00a0 uses this efficient algorithm.\u00a0 Importantly, the R version also checks to see if it can avoid recomputing values if the same \u03bb value is being used on repeated function calls. To me, this is just another example of how great the core R implementations often are.\u00a0 Although in my field the GSL seems to dominate when people code, I think a reason for this is that a lot of people don&#8217;t realize that many R functions can be easily incorporated into C\/C++ code using the <strong> <a href=\"https:\/\/www.google.com\/search?q=r+stand+alone+math\">R Stand Alone Math Library<\/a>.<\/strong>\u00a0 I have found simulating with the R libraries has many advantages over using the GSL.\r\n\r\nIn 1986 <a href=\"http:\/\/luc.devroye.org\/\">Luc Devroye<\/a> published a real tome on <a href=\"http:\/\/luc.devroye.org\/rnbookindex.html\">Random Variate Generation<\/a> (that is freely available online).\u00a0 Within it was a very good algorithm, which I gather is based on a 1981 paper, that is currently implemented by the <a href=\"http:\/\/research.microsoft.com\/en-us\/um\/cambridge\/projects\/infernet\/\"> <strong>Infer .NET<\/strong><\/a> package and the <a href=\"http:\/\/commons.apache.org\/math\/apidocs\/org\/apache\/commons\/math3\/distribution\/PoissonDistribution.html#sample()\"> <strong>Apache Commons Math<\/strong><\/a> Java library.\u00a0 The algorithm is given on page 511 of the book, but anyone independently implementing it should be aware that the published <a href=\"http:\/\/luc.devroye.org\/errors.pdf\">errata<\/a> corrects some errors in the original publication of this algorithm (Infer.NET has these corrections, but I didn&#8217;t check for them in the Apache Commons library).\r\n\r\nThe most modern algorithm I found implemented was a 1993 algorithm by Hormann [5], which his benchmarks showed either outperformed other algorithms at many \u03bb values, or was not that meaningfully slower over a narrower range of \u03bb values. <strong><a href=\"http:\/\/numpy.scipy.org\/\">Numpy<\/a> <\/strong> implements this routine, and it seems at least one commentator on <a href=\"http:\/\/www.johndcook.com\/blog\/2010\/06\/14\/generating-poisson-random-values\/\"> another blog<\/a>\u00a0 implemented it as well.\r\n\r\nSo which to use?\u00a0 I think one should make sure they are using a post-1979 algorithm (so not Matlab or GSL).\u00a0 Routines after this point are similar enough that if simulating Poisson variables is still the bottleneck one should benchmark available code that implements these different options.\u00a0 At this point the coding\/implementation specifics, and the ease of using that library, will likely be more important than the algorithm.\u00a0 If no available option adequately performs, the algorithms in [2], [4] and [5] can be evaluated for application to a specific usage scenario, though by then it is probably worth knowing that the algorithms that are truly the fastest have efficient pre-computed tables or a particular \u03bb\u00a0 value [6] .\u00a0 One such table lookup method from reference [6] is implemented in <strong>Matlab<\/strong> code available from an open source file exchange <a href=\"http:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/7309\">here<\/a>.\u00a0 Though I did not completely evaluate [6], the 2004 paper says it can outperform (5X-15X faster) the other algorithms I have discussed.\u00a0 However, after a quick run through of the paper, I couldn&#8217;t tell if the benchmarks in [6] were based on repeated simulations from the same \u03bb value, or with constantly changing \u03bb values (I suspect it is the former which is why this algorithm is not widely adopted).\u00a0 The algorithm in [6] strikes me as very efficient for the same \u03bb value, and C code for it is available from the publishing journal.\u00a0 Interestingly, a Matlab <a href=\"http:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/26042-compound-poisson-simulation\/content\/Compound_Poisson_Bench.html\"> benchmark test<\/a> shows this method is about 10X faster than the default <em>poissrnd<\/em> Matlab function, even though it recreates the required look up tables on each function call.\u00a0 However, I suspect the performance increase is slightly exaggerated because the original Matlab function has such a slow routine and this method switches directly to the normal approximation of the Poisson distribution when \u03bb is over 1,000.\u00a0 This shortcut could easily be incorporated to every other method mentioned, though the direct methods seem fast enough that I personally would not like to muck around evaluating how well one distribution mathematically approximates another distribution that itself is only approximating a complex reality.\r\n\r\nWhich brings me to the end of the first blog post.\u00a0 If you are still reading this you slogged through some insights gained from someone who just read Fortran code and academic papers from a time when Nirvana was still just a Buddhist goal and not a grunge rock band.\u00a0 Hopefully it prevents someone else from doing the same.\r\n<h4>\u00a0\u00a0\u00a0 References<\/h4>\r\n1. Ahrens JH, Dieter U (1974) Computer methods for sampling from gamma, beta, poisson and bionomial distributions. Computing 12: 223-246.\r\n2. Ahrens JH, Dieter U (1982) Computer generation of Poisson deviates from modified normal distributions. ACM Transactions on Mathematical Software (TOMS) 8: 163-179.\r\n3. Atkinson A (1979) The computer generation of Poisson random variables. Applied Statistics: 29-35.\r\n4. Devroye L (1986) Non-uniform random variate generation: Springer-Verlag New York.\r\n5. H\u00f6rmann W (1993) The transformed rejection method for generating Poisson random variables. Insurance: Mathematics and Economics 12: 39-45.\r\n6. Marsaglia G, Tsang WW, Wang J (2004) Fast generation of discrete random variables. Journal of Statistical Software 11: 1-11.\r\n\r\n&nbsp;","protected":false},"excerpt":{"rendered":"Spurred by a slow running program, I spent an afternoon researching what algorithms are available for generating Poisson random variables and figuring out which methods are used by R, Matlab, NumPy, the GNU Science Libraray and various other available packages. I learned some things that I think would be useful to have on the internet [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[2],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":112,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112","url_meta":{"origin":6,"position":0},"title":"Mono.Simd and the Mandlebrot Set.","date":"September 10, 2013","format":false,"excerpt":"C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2_thumb.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":71,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=71","url_meta":{"origin":6,"position":1},"title":"Java vs. C# Performance Comparison for Parsing VCF Files","date":"May 26, 2013","format":false,"excerpt":"Making a comparison with a reasonably complex program ported between the two languages. Update 3\/10\/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.\u00a0\u00a0After this, the Java\/C# versions are indistinguishable on speed, but\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/05\/image_thumb1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":398,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=398","url_meta":{"origin":6,"position":2},"title":".NET Bio is Significantly Faster on .Net Core 2.0","date":"November 5, 2017","format":false,"excerpt":"Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement. The .NET Bio\u00a0library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be\u00a0undertaken with interpreted\u2026","rel":"","context":"In \".NET Bio\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2017\/11\/Benchmark-1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":153,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=153","url_meta":{"origin":6,"position":3},"title":"Using Selectome with .NET Bio, F# and R","date":"September 16, 2013","format":false,"excerpt":"The Bio.Selectome namespace has features to query\u00a0Selectome.Selectome is a database that merges data from Ensembl\u00a0and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN\/dS)\u00a0mutations along various branches of the phylogenetic tree. A low dN\/dS ratio\u00a0indicates that the protein sequence is under strong selective constraint, while\u2026","rel":"","context":"In &quot;.NET Bio&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":35,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=35","url_meta":{"origin":6,"position":4},"title":"Comparing data structure enumeration speeds in C#","date":"March 1, 2013","format":false,"excerpt":"Determining which data structure to use for storing data involves trade-offs between how much memory they require and how long different operations, such as insertions, deletions, or searches take.\u00a0 In C#, using a linear array is the fastest way to enumerate all of the objects in a collection.\u00a0 However, the\u2026","rel":"","context":"Similar post","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":359,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=359","url_meta":{"origin":6,"position":5},"title":"Profiling Rcpp package code on Windows","date":"September 3, 2016","format":false,"excerpt":"Profiling Rcpp code on Unix\/Mac is easy, but is difficult on Windows because R uses a compilation toolchain (MinGW) that produces files that are not understood by common Windows profiling programs.\u00a0 Additionally, the R build process often removes\u00a0symbols which allow profilers to produce sensible interpretations of their data. The following\u2026","rel":"","context":"In \"Optimization\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2016\/09\/assembly.png?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/6"}],"collection":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6"}],"version-history":[{"count":3,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/6\/revisions"}],"predecessor-version":[{"id":11,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/6\/revisions\/11"}],"wp:attachment":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}