<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4522943003528921495</id><updated>2012-02-17T00:53:48.094+01:00</updated><category term='Epidemiology'/><category term='Sharks'/><category term='Applied mathematics'/><category term='Statistics'/><category term='Probability'/><category term='Data visualization'/><category term='Art'/><category term='My research'/><category term='Science'/><category term='Combinatorics'/><category term='Fun stuff'/><category term='Programming'/><category term='Robyn'/><category term='R'/><title type='text'>Looking at data</title><subtitle type='html'>A blog about statistics, probability, mathematics and data visualization.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>7</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-5608513001494021892</id><published>2011-12-14T11:46:00.000+01:00</published><updated>2011-12-14T11:46:52.079+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Science'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Are Higgs findings due to chance?</title><content type='html'>As media all over the world are &lt;a href="http://www.bbc.co.uk/news/science-environment-16158374"&gt;reporting&lt;/a&gt; that scientists at CERN may have glimpsed the &lt;a href="http://en.wikipedia.org/wiki/Higgs_boson"&gt;Higgs particle&lt;/a&gt; for the first time, journalists struggle to explain why the physicists are reluctant to say that the results are conclusive. Unfortunately, they seem to be doing so by grossly misinterpreting one of the key concepts in modern statistics: the p-value.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://public.web.cern.ch/public/features/111213.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="260" src="http://public.web.cern.ch/public/features/111213.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;First of all: what does statistics have to do with the Higgs boson findings? It seems that there would be no need for it - either you've seen the God particle, or you haven't seen the Goddamn particle. But since the Higgs boson is so minuscule, it can't be observed directly. The only chance of detecting it is to look for anomalies in the data from the &lt;a href="http://en.wikipedia.org/wiki/Large_Hadron_Collider"&gt;Large Hadron Collider&lt;/a&gt;. There will always be some fluctuations in the data, so you have to look for anomalies that (a) are large enough and (b) are consistent with the properties of the hypothesised Higgs particle.&lt;br /&gt;&lt;br /&gt;What the two teams at CERN reported yesterday is that both teams have found largish anomalies in the same region, independently of each other. They quantify how large the anomalies are by describing them as, for instance, as being roughly "two sigma events". Events that have more sigmas, or standard deviations, associated with them are less likely to happen if there is no Higgs boson. Two sigma events are fairly rare: if there is no Higgs boson then anomalies at least as large as these would only appear in one experiment out of 20. This probability is known as the p-value of the results.&lt;br /&gt;&lt;br /&gt;This is were the misinterpretation comes into play. Virtually every single news report on the Higgs findings seem to present a two sigma event as meaning that there is a 1/20 probability of the results being due to chance. Or in other words, that there is a 19/20 probability that the Higgs boson has been found.&lt;br /&gt;&lt;br /&gt;In what I think is an excellent explanation of the difference, David Spiegelhalter &lt;a href="http://understandinguncertainty.org/why-it%E2%80%99s-important-be-pedantic-about-sigmas-and-commas"&gt;attributed the misunderstanding to a missing comma&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;The number of sigmas does not say 'how unlikely the result is due to chance': it measures 'how unlikely the result is, due to chance'.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The difference may seem subtle... but it's not. The number of sigmas, or the p-value, only tells you how often you would see this kind of results if there was no Higgs boson. That is not the same as the probability of there being such a particle - there either is or isn't a Higgs particle. Its existence is independent of the experiments and therefore the experiments don't tell us anything about the &lt;i&gt;probability&lt;/i&gt; of the existence.&lt;br /&gt;&lt;br /&gt;What we can say, then, is only that the teams at CERN have found anomalies that suspiciously large, but not so large that we feel that they can't be due simply to chance. Even if there was no Higgs boson, anomalies of this size would appear in roughly one experiment out of twenty, which means that they are slightly more common than getting five heads in a row when flipping a coin.&lt;br /&gt;&lt;br /&gt;If you flip a coin five times and got only heads, you wouldn't say that since the p-value is 0.03, there is a 97 % chance that the coin only gives heads. The coin either is or isn't unbalanced. You can't make statements about the probability of it being unbalanced without having some prior information about how often coins are balanced and unbalanced.&lt;br /&gt;&lt;br /&gt;The experiments at CERN are no different from coin flips, but there is no prior information about in how many universes the Higgs boson exists. That's why the scientists are reluctant to say that they've found it. That's why they need to make more measurement. That's why two teams are working independently from each other. That's why there isn't a 95 % chance that the Higgs boson has been found.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;David Spiegelhalter &lt;a href="http://twitter.com/#!/undunc/status/143045796319866880"&gt;tweeted&lt;/a&gt; about his Higgs post with the words "BBC wrong in describing evidence for the Higgs Boson, nerdy pedantic statistician claims". Is it pedantic to point out that people are fooling themselves?&lt;br /&gt;&lt;br /&gt;I wholeheartedly recommend the Wikipedia list of &lt;a href="http://en.wikipedia.org/wiki/P_value#Misunderstandings"&gt;seven common misunderstandings&lt;/a&gt; about the p-value.&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-5608513001494021892?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/5608513001494021892/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2011/12/are-higgs-findings-due-to-chance.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/5608513001494021892'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/5608513001494021892'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2011/12/are-higgs-findings-due-to-chance.html' title='Are Higgs findings due to chance?'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-3663766272148339705</id><published>2011-04-06T17:42:00.000+02:00</published><updated>2011-04-06T17:42:38.135+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Programming'/><category scheme='http://www.blogger.com/atom/ns#' term='My research'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Speeding up R computations</title><content type='html'>The past few days I've been going through some &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; code that I wrote last year, when I was preparing a massive&amp;nbsp;simulation-based&amp;nbsp;power study for some tests for&amp;nbsp;multivariate&amp;nbsp;normality&amp;nbsp;that I've been working on. My goal was to reduce the time needed to run the simulation. I wasn't expecting great improvement, since&amp;nbsp;I've always believed that the most common R functions are properly vectorized and optimized for speed. Turns out I was wrong. Very wrong.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;img border="0" src="http://www.r-project.org/Rlogo.jpg" /&gt;&lt;/div&gt;&lt;br /&gt;The first thing that I did was that I replaced all parentheses ( ) by curly brackets { }. I was inspired to do so by&amp;nbsp;&lt;a href="http://radfordneal.wordpress.com/2010/08/15/two-surpising-things-about-r/"&gt;this post&lt;/a&gt;&amp;nbsp;(and &lt;a href="http://radfordneal.wordpress.com/2010/08/19/speeding-up-parentheses-and-lots-more-in-r/"&gt;this&lt;/a&gt;, via &lt;a href="http://xianblog.wordpress.com/2010/09/06/insane/"&gt;Xi'Ans Og&lt;/a&gt;) over at Radford Neal's blog. As he pointed out, code that uses parentheses is actually slower than the same code with curly brackets:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:1000000) { 1*(1+1) } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;1.337 &amp;nbsp; 0.005 &amp;nbsp; 1.349&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:1000000) { 1*{1+1} } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;1.072 &amp;nbsp; 0.003 &amp;nbsp; 1.076&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Similarly, you can compare&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a*a&lt;/span&gt; and&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a^2&lt;/span&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:10000000) 3^2 )&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;5.048 &amp;nbsp; 0.028 &amp;nbsp; 5.088&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:10000000) 3*3 )&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;4.721 &amp;nbsp; 0.024 &amp;nbsp; 4.748&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So,&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a^2&amp;nbsp;&lt;/span&gt;is slower than&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a*a.&lt;/span&gt;&amp;nbsp;This made me wonder, are there other built-in R functions that are slower than they ought to be?&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;One thing that I found very surprising, and frankly rather disturbing, is that &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;mean(x&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;) takes ten times as long to calculate the mean value of the 50 real numbers in the vector &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;x&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; as the "manual" function&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;sum(x)/50:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; x&amp;lt;-rnorm(50)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time(for(i in 1:100000){mean(x)})&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;1.522 &amp;nbsp; 0.000 &amp;nbsp; 1.523&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time(for(i in 1:100000){sum(x)/length(x)})&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.200 &amp;nbsp; 0.000 &amp;nbsp; 0.200&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time(for(i in 1:100000){sum(x)/50})&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.167 &amp;nbsp; 0.000 &amp;nbsp; 0.167&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time(for(i in 1:100000){ overn&amp;lt;-rep(1/50,50); x%*%overn })&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.678 &amp;nbsp; 0.000 &amp;nbsp; 0.677&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; overn&amp;lt;-rep(1/50,50); system.time(for(i in 1:100000){ x%*%overn })&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.164 &amp;nbsp; 0.000 &amp;nbsp; 0.164&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I guess that the R&amp;nbsp;development&amp;nbsp;core team have been focusing on making R an easy-to-use high level programming language rather than optimizing all functions, but the poor performance of&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;mean&lt;/span&gt;&amp;nbsp;is just embarrassing.&lt;br /&gt;&lt;br /&gt;Similarly, the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;var&lt;/span&gt; function can be greatly improved upon. Here are some of the many possibilites:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; x &amp;lt;- rnorm(50)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { var(x) } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;4.921 &amp;nbsp; 0.000 &amp;nbsp; 4.925&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { sum((x-mean(x))^2)/{length(x)-1} } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;2.322 &amp;nbsp; 0.000 &amp;nbsp; 2.325&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { {sum(x*x)-sum(x)*sum(x)/length(x)}/{length(x)-1} } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.736 &amp;nbsp; 0.000 &amp;nbsp; 0.737&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { {sum(x*x)-sum(x)*sum(x)/50}/49 } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.618 &amp;nbsp; 0.000 &amp;nbsp; 0.618&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { sx&amp;lt;-sum(x); {sum(x*x)-sx*sx/50}/49 } )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp;0.567 &amp;nbsp; 0.000 &amp;nbsp; 0.568&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;I changed all the uses of &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;mean&lt;/span&gt;&amp;nbsp;in my code to "&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;sum/n&lt;/span&gt;"&amp;nbsp;instead (and avoided using &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;var&lt;/span&gt; entirely) and found that this sped things up quite a bit.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another trick to speed up your computations is to create the vectors that you wish to change within a loop with the right number of elements. While&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a&amp;lt;-NA&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;for(j in 1:100) a[j]&amp;lt;-j&lt;/span&gt;&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;works just fine, it is actually quite a bit slower than&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a&amp;lt;-rep(NA,100)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;for(j in 1:100) a[j]&amp;lt;-j&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You could create &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a&lt;/span&gt; in other ways as well of course, for instance by&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;a&amp;lt;-vector(length=100).&lt;/span&gt;&amp;nbsp;Here are the numbers:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { a&amp;lt;-NA; for(j in 1:100) a[j]&amp;lt;-j })&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;37.383 &amp;nbsp; 0.092 &amp;nbsp;37.482&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { a&amp;lt;-rep(NA,100); for(j in 1:100) a[j]&amp;lt;-j })&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;25.866 &amp;nbsp; 0.065 &amp;nbsp;25.936&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( for(i in 1:100000) { a&amp;lt;-vector(length=100); for(j in 1:100) a[j]&amp;lt;-j })&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;25.517 &amp;nbsp; 0.022 &amp;nbsp;25.548&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;In my case, I'd been a bit sloppy with creating the vectors in my loops in the proper way, so I changed this in my code as well.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;In my simulation study, I simulate multivariate random variables, compute some test statistics and use these to estimate the powers of the normality tests against various alternatives. After doing the changes mentioned above, I compared the performance of my old code to that of the new code, for 1000 iterations of the procedure:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( source("oldCode.R") )&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;548.045 &amp;nbsp; 0.273 548.622&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;gt; system.time( source("newCode.R") )&lt;/span&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; user &amp;nbsp;system elapsed&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;93.138 &amp;nbsp; 0.002 &amp;nbsp;93.194&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The improved code is almost 6 times faster than the old code. When you do ten million or so iterations, that matters. A lot.&lt;br /&gt;&lt;br /&gt;In conclusion, it's definitely possible to speed up your code significantly if you know of the pitfalls of R. I suspect that I'll be obsessed with finding more pitfalls in the next few weeks, so I'd be thankful for any hints about other weaknesses that R has.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;It should probably be mentioned that R is really fast when things are properly vectorized. Last year, a coworker that uses Matlab challenged me to perform a number of matrix computations faster in R than in Matlab. To his great surprise, R won.&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;As a final remark, I'm now facing a bit of a dilemma. Should I write readable code; &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: xx-small;"&gt;a^6&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;; or fast code; &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: xx-small;"&gt;a*a*a*a*a*a&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;?&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-3663766272148339705?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/3663766272148339705/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2011/04/speeding-up-r-computations.html#comment-form' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/3663766272148339705'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/3663766272148339705'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2011/04/speeding-up-r-computations.html' title='Speeding up R computations'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-5285326102832115536</id><published>2011-02-15T21:55:00.000+01:00</published><updated>2011-02-15T21:55:15.725+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Epidemiology'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Applied mathematics'/><title type='text'>Narcolepsy and the swine flu vaccine</title><content type='html'>Earlier this year the Finnish National Institute for Health and Wellfare published a report about the &lt;a href="http://www.thl.fi/en_US/web/en/pressrelease?id=24103"&gt;increased risk of narcolepsy observed among children and adolescents vaccinated with PandemrixR&lt;/a&gt;. In short, the conclusion was that the swine flu vaccine seemed to have had an unexpected side effect; the risk of narcolepsy, a sleeping disorder disease, was larger for vaccinated children in the 4-19 year age group than for unvaccinated children in the same age group.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Pandemrix_002.jpg/800px-Pandemrix_002.jpg" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Pandemrix_002.jpg/800px-Pandemrix_002.jpg" width="320" /&gt;&lt;/a&gt;&lt;a href="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Pandemrix_002.jpg/800px-Pandemrix_002.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Steven Novella at &lt;a href="http://www.sciencebasedmedicine.org/?p=10720"&gt;Science-Based Medicine wrote a great piece&lt;/a&gt; about this last week, discussing how this should be interpreted. I'm not going to go into a discussion about the findings themselves, but I would like to discuss the following part of the press release:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: inherit;"&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;In Finland during years 2009–10, 60 children and adolescents aged 4-19  years fell ill with narcolepsy. These figures base on data from  hospitals and primary care, and the review of individual patient records  by a panel of neurologists and sleep researchers. Of those fallen ill,  52 (almost 90 percent) had received Pandemrix® vaccine, while the  vaccine coverage in the entire age group was 70 percent. Based on the  preliminary analyses, the risk of falling ill with narcolepsy among  those vaccinated in the 4-19 years age group was 9-fold in comparison to  those unvaccinated in the same age group. &lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;br /&gt;Sceptical commenters on blogs and forums have questioned whether a 9-fold increase in risk really was observed. Here's the reasoning:&lt;br /&gt;&lt;br /&gt;The estimated risk within a group is (the number of observed cases of the disease)/(size of the group). That is,&lt;br /&gt;&lt;br /&gt;Risk for vaccinated child: 52/(&lt;i&gt;n&lt;/i&gt;*0.7) = 1/&lt;i&gt;n&lt;/i&gt; * 52/0.7&lt;br /&gt;Risk for unvaccinated child: 8/(&lt;i&gt;n&lt;/i&gt;*0.3) = 1/&lt;i&gt;n&lt;/i&gt; * 8/0.3&lt;br /&gt;&lt;br /&gt;where &lt;i&gt;n&lt;/i&gt; is the number of children in the 4-19 age group.&lt;br /&gt;&lt;br /&gt;So that the &lt;a href="http://en.wikipedia.org/wiki/Relative_risk"&gt;relative risk&lt;/a&gt;, i.e. the risk increase for the vaccinated children, is&lt;br /&gt;&lt;br /&gt;(52/0.7)/(8/0.3)=2.79 .&lt;br /&gt;&lt;br /&gt;Hang on a minute. 2.79? If a 9-fold increase in risk was observed the relative risk should be 9! It seems that the Finnish epidemiologists made a mistake.&lt;br /&gt;&lt;br /&gt;...or did they?&lt;br /&gt;&lt;br /&gt;Not necessarily. When analyzing this data, we need to take time into account. &lt;a href="http://www.thl.fi/thl-client/pdfs/f890b9f3-9922-4efe-889b-157fe2e03aa4"&gt;The report itself is only available in Finnish&lt;/a&gt;, but using &lt;a href="http://translate.google.com/"&gt;Google Translate&lt;/a&gt; I gathered that the unvaccinated group were studied from January 2009 to August 2010 whereas the vaccinated individuals were studied from the date of vaccination and eight months on. In other words, the unvaccinated group had a longer time span in which they could fall ill.&lt;br /&gt;&lt;br /&gt;That means that in order to calculate the relative risk, we need to divide the number of cases by the number of months that the groups were studied, to get the risk per month. That eliminates the time factor. After doing this, the relative risk becomes &lt;br /&gt;&lt;br /&gt;((52/8)/0.7)/((8/20)/0.3)=6.96.&lt;br /&gt;&lt;br /&gt;That's higher, but still not 9. Well, to complicate things a bit it seems that an individual was considered to  be a part of the unvaccinated group until the date of vaccination,  making the calculations a bit more difficult. When that is taken into account, along with other difficulties that no doubt occur when you have the actual data at hand, the relative risk probably becomes 9.&lt;br /&gt;&lt;br /&gt;The full report is not yet available, so I can't say how close the above approach is to the one that was actually used in the analysis. Nevertheless, I hope that this post can help shed some light on the statistics behind the statement about a 9-fold increase.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;A problem with this approach is that the number of months under which the unvaccinated group was studied might affect the results, just as in the &lt;a href="http://lookingatdata.blogspot.com/2011/02/sharks-sharks_10.html"&gt;shark attack example&lt;/a&gt; that I wrote about last week. Changing the time span for the unvaccinated group to January 2008 to August 2010, say, does however not change the conclusion in this case. The analysis seems to be pretty robust to the length of time under which the control group were studied.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;WHO issued some &lt;a href="http://www.who.int/vaccine_safety/topics/influenza/pandemic/h1n1_safety_assessing/narcolepsy_february2011/en/index.html"&gt;comments&lt;/a&gt; regarding the Finnish study that are well worth reading.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-5285326102832115536?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/5285326102832115536/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2011/02/narcolepsy-and-swine-flu-vaccine.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/5285326102832115536'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/5285326102832115536'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2011/02/narcolepsy-and-swine-flu-vaccine.html' title='Narcolepsy and the swine flu vaccine'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-7686113043156441692</id><published>2011-02-10T10:56:00.000+01:00</published><updated>2011-02-10T10:56:00.334+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sharks'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Applied mathematics'/><category scheme='http://www.blogger.com/atom/ns#' term='Probability'/><title type='text'>Sharks! Sharks!</title><content type='html'>The International Shark Attack File, or ISAF for short, recently published their &lt;a href="http://www.flmnh.ufl.edu/fish/sharks/isaf/2010summary.html"&gt;2010 worldwide shark attack summary&lt;/a&gt; and the findings have &lt;a href="http://www.nydailynews.com/news/world/2011/02/08/2011-02-08_global_shark_attacks_on_the_rise_in_2010_us_again_leads_rest_of_world.html?r=news/national"&gt;been&lt;/a&gt; &lt;a href="http://www.google.com/hostednews/afp/article/ALeqM5gUDlcMjmW-LeT_QcfBG65smRcg1g?docId=CNG.8255edf428f2a222d5d2ec1bb4d9df34.931"&gt;reported&lt;/a&gt; &lt;a href="http://latimesblogs.latimes.com/outposts/2011/02/us-led-the-world-in-shark-attacks-last-year.html"&gt;by&lt;/a&gt; &lt;a href="http://www.msnbc.msn.com/id/41458324/ns/world_news-world_environment/"&gt;international&lt;/a&gt; &lt;a href="http://www.dn.se/nyheter/varlden/2010--aret-da-hajarna-attackerade"&gt;media&lt;/a&gt; the last couple of days. The main message has been that the number of unprovoked shark attacks last year, 79, was larger than usual.&lt;br /&gt;&lt;br /&gt;The question that is left unanswered in the articles that I've read is "just how unusual is 79 shark attacks in a year?" Let's find out!&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-DPE9zsXOymk/TVO0FCrZLgI/AAAAAAAAAFc/jE2n8cqjFNk/s1600/JAWS_Movie_poster.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-DPE9zsXOymk/TVO0FCrZLgI/AAAAAAAAAFc/jE2n8cqjFNk/s320/JAWS_Movie_poster.jpg" width="221" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Assume that a trial is performed &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; independent times (where &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; is large) and that at each trial the probability &lt;span style="font-style: italic;"&gt;p&lt;/span&gt; of some given event is small. If &lt;span style="font-style: italic;"&gt;X&lt;/span&gt; is the number of trials in which the event occurs, then &lt;span style="font-style: italic;"&gt;X&lt;/span&gt; will be &lt;a href="http://en.wikipedia.org/wiki/Poisson_distribution"&gt;Poisson distributed&lt;/a&gt;. This means that the probability that &lt;span style="font-style: italic;"&gt;X&lt;/span&gt; equals some value &lt;span style="font-style: italic;"&gt;k&lt;/span&gt; will be&lt;br /&gt;&lt;br /&gt;P&lt;span style="font-style: italic;"&gt;(X=k)=m&lt;/span&gt;&lt;sup style="font-style: italic;"&gt;k&lt;/sup&gt;&lt;span style="font-style: italic;"&gt;/k!*e&lt;/span&gt;&lt;sup style="font-style: italic;"&gt;-m&lt;/sup&gt;&lt;br /&gt;&lt;br /&gt;where &lt;span style="font-style: italic;"&gt;k!=k*(k-1)*(k-2)*...*3*2*1&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;m=n*p&lt;/span&gt; is the average number of times that the event will occur. This fact has not only been seen empirically, but can also be proved using "basic" tools of probability theory.&lt;br /&gt;&lt;br /&gt;Now, a large number of people, &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;,  spend time in the sea each year, and for each such person there is a  small probability, &lt;span style="font-style: italic;"&gt;p&lt;/span&gt;, that the person is attacked by a shark. The people are more or less independent, and therefore we can argue that the number of shark attacks in a year should be (at least approximately) Poisson distributed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;For the mathematically minded, I should probably point out that I'm aware that the shark attack probabilities &lt;/span&gt;&lt;span style="font-size: x-small; font-style: italic;"&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;  differ between different areas. This does not really contradict the  assumption that shark attacks follow a Poisson process, as we can view  the global shark attacks as the union of (essentially) independent  Poisson processes with different intensities.&lt;/span&gt; &lt;br /&gt;&lt;br /&gt;If we want to calculate the probability of a certain number of shark attacks, we now need to estimate the average number of shark attacks in a year, &lt;span style="font-style: italic;"&gt;m&lt;/span&gt;. Well, from the &lt;a href="http://www.flmnh.ufl.edu/fish/sharks/statistics/statsw.htm"&gt;ISAF 2000-2010 statistics&lt;/a&gt; we see that there's been an average   of 71.5 shark attacks annually, or an average of 63.6 if we only look at   the 2000-2009 period. Let's assume that the intensity of shark attacks has been constant throughout the last eleven years and that deviations from the mean are random and not due to some trend. In that case we can use the above averages as our estimates of &lt;i&gt;m&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Using the estimate &lt;span style="font-style: italic;"&gt;m&lt;/span&gt;=71.5 we get that the probability of at least 79 shark attacks in a year is 0.17, which means that we should expect more than 79 shark attacks roughly once in every six years. In the last eleven years there has been two such years, 2000 (80 attacks) and 2010 (79 attacks), which is more or less exactly what we would expect.&lt;br /&gt;&lt;br /&gt;If we instead use the lower estimate &lt;span style="font-style: italic;"&gt;m&lt;/span&gt;=63.6 we get that the probability is 0.026, which would mean that we can expect at least 79 shark attacks once in 38 years.&lt;br /&gt;&lt;br /&gt;Were we to use the 2001-2009 data only, our estimate would be &lt;span style="font-style: italic;"&gt;m&lt;/span&gt;=55.6 and the estimated probability of at least 79 sharks attacks would be as small as 0.001. In this scenario, 2010 would have been a one-in-a-thousand extreme when it comes to shark attacks!&lt;br /&gt;&lt;br /&gt;There's actually a valuable lesson here. By choosing different years to include when calculating our estimate of &lt;span style="font-style: italic;"&gt;m&lt;/span&gt; we arrived at completely different conclusions. It was easy for us to do in this example, and it's just as easy for anyone else to do when they want to present statistics. Lies, damn lies, ...&lt;br /&gt;&lt;br /&gt;People tend to be afraid of sharks, and it is therefore interesting to note that out of the 79 sharks attacks last year, only 6 were fatal. According to the CIA World Factbook, 56.6 million people died in 2010. That means that the risk of being killed by a shark is approximately 0.0000001! That's a pretty abstract number, but maybe the &lt;a href="http://www.flickr.com/photos/pseudoplacebo/4651625768/sizes/o/in/photostream/"&gt;"What's most likely to kill you?"&lt;/a&gt; infographic can help you visualize it. It illustrates the risks of various causes for death, but does unfortunately not include shark attacks... The &lt;a href="http://www.mnn.com/earth-matters/animals/stories/infographic-shark-attacks"&gt;ultimate shark infographic is probably this one&lt;/a&gt;, from last year.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;Actually, this &lt;a href="http://en.wikipedia.org/wiki/Shark#Fishery"&gt;global shark catch graph&lt;/a&gt; tells us that the sharks are the ones who need to fear the humans.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On a side note, I teach a course called &lt;span style="font-style: italic;"&gt;Statistics for engineers&lt;/span&gt; this semester and when I introduced the Poisson distribution two weeks ago, I used shark attacks as an example of an application of the distribution (along with some engineering applications, of course). I was inspired by &lt;a href="http://www.stat.ualberta.ca/people/schmu/preprints/poisson.pdf"&gt;this paper&lt;/a&gt;, from which I also borrowed a data set about the number of points Wayne Gretzky scored in each game during his time in Edmonton Oilers. Funnily, I gave the lecture on January 26, which was Gretzky's 50th birthday. When I introduced the &lt;a href="http://en.wikipedia.org/wiki/Binomial_distribution"&gt;binomial distribution&lt;/a&gt; I used the predictions of the 2010 FIFA world cup oracle &lt;a href="http://en.wikipedia.org/wiki/Paul_the_Octopus"&gt;Paul the Octopus&lt;/a&gt; as an illustrating example, who happened to be born on the 26th of January as well. This allowed me to move seamlessly on to the &lt;a href="http://en.wikipedia.org/wiki/Birthday_problem"&gt;birthday problem&lt;/a&gt; ("what is the probability that there is at least one pair of people in this room that have the same birthday?"), which we could solve using the binomial and Poisson distributions. Sometimes &lt;a href="http://en.wikipedia.org/wiki/Fortuna"&gt;Fortuna&lt;/a&gt; is on your side...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;As I have more than 120 students in the course, the probability of at least on pair of people sharing a birthday was ridiculously close to 1.&lt;/span&gt;&lt;span style="font-size: x-small;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-7686113043156441692?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/7686113043156441692/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2011/02/sharks-sharks_10.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/7686113043156441692'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/7686113043156441692'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2011/02/sharks-sharks_10.html' title='Sharks! Sharks!'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-DPE9zsXOymk/TVO0FCrZLgI/AAAAAAAAAFc/jE2n8cqjFNk/s72-c/JAWS_Movie_poster.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-8280226611060477135</id><published>2011-02-03T14:06:00.001+01:00</published><updated>2011-02-03T14:09:34.306+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='Fun stuff'/><category scheme='http://www.blogger.com/atom/ns#' term='Art'/><title type='text'>Art as statistics / statistics as art</title><content type='html'>Have you ever felt that you would love to have a classic painting hanging on your wall, but that it just isn't scientific enough? After all, you wouldn't want people to think that you're anything less than a completely rational scientifically minded person.&lt;br /&gt;&lt;br /&gt;Well, fret no more, Arthur Buxton's &lt;a href="http://arthurbuxton.blogspot.com/2010/11/van-gogh-visualisation.html"&gt;van Gogh visualization&lt;/a&gt; might be the solution to your problem!&lt;br /&gt;&lt;br /&gt;&lt;a href="http://arthurbuxton.blogspot.com/2010/11/van-gogh-visualisation.html"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 226px; height: 320px;" src="http://4.bp.blogspot.com/_OVaWBQLF3Ws/TOvT6HvCgDI/AAAAAAAAAHU/5mmr6Jy9imM/s320/christmas%2Bcard.jpg" alt="" border="0" /&gt;&lt;/a&gt;Buxton's pie charts show the percentages of different colours used in different van Gogh paintings. It's art as statistics!&lt;br /&gt;&lt;br /&gt;Mario Klingemann uses pie charts in a similar way &lt;a href="http://www.flickr.com/photos/quasimondo/sets/72157624374940604/"&gt;to give famous paintings new life&lt;/a&gt; with his "pie packed" pictures. Here are Michelangelo's &lt;span style="font-style: italic;"&gt;Creation of Adam&lt;/span&gt; and Vermeer's &lt;span style="font-style: italic;"&gt;The Girl with a Pearl Earring&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/quasimondo/sets/72157624374940604/"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 500px; height: 247px;" src="http://farm5.static.flickr.com/4041/4658972629_9a52d15b94.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;a href="http://farm5.static.flickr.com/4060/4671433879_abc89bec45.jpg"&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/quasimondo/sets/72157624374940604/"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 351px; height: 500px;" src="http://farm2.static.flickr.com/1281/4669551487_34496c71aa.jpg" alt="" border="0" /&gt;&lt;/a&gt;The principle is the same here as with Buxton's pie charts - each pie shows the percentage of different colours in the area that it covers. It's statistics as art!&lt;br /&gt;&lt;br /&gt;And as I'm writing about pie charts that describe a picture, I can't help but bring up this old XKCD picture:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://imgs.xkcd.com/comics/self_description.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 238px; height: 180px;" src="http://4.bp.blogspot.com/_h7In_HkTPwY/TUqo8YYC88I/AAAAAAAAAFQ/xc1veCM2SBA/s320/xkcd.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5569449644485964738" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It's funny because it's true. Incidentally, XKCD also did pie charts relating to some of the &lt;a href="http://xkcd.com/197/"&gt;old masters&lt;/a&gt;.&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;br /&gt;The &lt;a href="http://lovestats.wordpress.com/"&gt;LoveStat&lt;/a&gt; blog recently wrote about both Buxton and Klingemann. My main reason for writing this post is to remind myself that I still haven't framed the van Gogh posters I bought in Amsterdam four years ago.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-8280226611060477135?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/8280226611060477135/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2011/02/art-as-statistics-statistics-as-art.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/8280226611060477135'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/8280226611060477135'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2011/02/art-as-statistics-statistics-as-art.html' title='Art as statistics / statistics as art'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_OVaWBQLF3Ws/TOvT6HvCgDI/AAAAAAAAAHU/5mmr6Jy9imM/s72-c/christmas%2Bcard.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-777250324985293502</id><published>2010-12-20T16:14:00.000+01:00</published><updated>2010-12-20T16:15:01.801+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Robyn'/><category scheme='http://www.blogger.com/atom/ns#' term='Combinatorics'/><category scheme='http://www.blogger.com/atom/ns#' term='Applied mathematics'/><title type='text'>The mathematics of a beat machine</title><content type='html'>2010 has been Swedish pop star &lt;a href="http://www.robyn.com/"&gt;Robyn&lt;/a&gt;'s year. She's released a three-part album called &lt;span style="font-style: italic;"&gt;Body Talk&lt;/span&gt;, been nominated to countless awards and broken new ground with her &lt;a href="http://www.robyn.com/killingme/"&gt;3D music video with Twitter integration&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;One of my favourite tracks on the second installment in the &lt;span style="font-style: italic;"&gt;Body Talk&lt;/span&gt; series is &lt;span style="font-style: italic;"&gt;We dance to the beat&lt;/span&gt;, the lyrics of which describe how we dance to the beat of all sorts of wonderful things, such as&lt;span style="font-style: italic;"&gt; the beat of the continents shifting under our feet&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;I'm a mathematician, so needless to say the line &lt;span style="font-style: italic;"&gt;we dance to the beat of false math and unrecognized genius&lt;/span&gt; caught my attention. That line was one of the reasons that I listened to &lt;span style="font-style: italic;"&gt;Body Talk pt. 2&lt;/span&gt; over and over again (yes, I'm a bit of a geek). Luckily, the album is nothing short of amazing, just like parts 1 and 3, so the listening pleasure was far greater than what I would get simply from hearing a pop star mentioning mathematics.&lt;br /&gt;&lt;br /&gt;In what must surely be a future textbook example of &lt;a href="http://www.skepdic.com/jung.html"&gt;synchronicity&lt;/a&gt;, the words &lt;span style="font-style: italic;"&gt;false math&lt;/span&gt; had been stuck in my head for a couple of days when the guys from the digital production company &lt;a href="http://www.dinahmoe.com/"&gt;DinahMoe&lt;/a&gt; contacted me. As it turned out, &lt;span style="font-style: italic;"&gt;We dance to the beat &lt;/span&gt;was in need of some &lt;span style="font-style: italic;"&gt;real mathematics.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQ9H_yr89OI/AAAAAAAAAEE/j0l7eROMX_0/s1600/robyn2.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 199px;" src="http://1.bp.blogspot.com/_h7In_HkTPwY/TQ9H_yr89OI/AAAAAAAAAEE/j0l7eROMX_0/s320/robyn2.jpg" alt="" id="BLOGGER_PHOTO_ID_5552736026834564322" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;DinahMoe were developing a first of its kind interactive music experience in collaboration with &lt;a href="http://www.blipboutique.com/home/"&gt;Blip Butique&lt;/a&gt;. Released a week ago, it allows users to remix &lt;span style="font-style: italic;"&gt;We dance to the beat&lt;/span&gt; and create a music video of sorts by combining sounds and visuals on the &lt;a href="http://www.robyn.com/wedancetothebeat"&gt;interactive beat machine site&lt;/a&gt;. The users search through a grid of video/sound clips and  choose a combination of four videos and four sound clips.  They can then interact  with the clips to manipulate both audio and video, in  perfect sync with the beat.&lt;br /&gt;&lt;br /&gt;Users can share their remixes on Twitter or Facebook and look at the ever-growing seamless &lt;a href="http://www.robyn.com/wedancetothebeat/#/stream"&gt;stream&lt;/a&gt; of remixes created by others. Every now and then the stream of remixes is interrupted by pre-programmed mixes with footage of Robyn, making the stream experience even cooler.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQ9H_yr89OI/AAAAAAAAAEE/j0l7eROMX_0/s1600/robyn2.jpg"&gt;&lt;br /&gt;&lt;/a&gt;&lt;a href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQ9ICUsWQCI/AAAAAAAAAEM/gV12CKUDhj0/s1600/robyn3.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 198px;" src="http://1.bp.blogspot.com/_h7In_HkTPwY/TQ9ICUsWQCI/AAAAAAAAAEM/gV12CKUDhj0/s320/robyn3.jpg" alt="" id="BLOGGER_PHOTO_ID_5552736070322765858" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;So, where do the mathematics enter into all of this? Well, there are 14 basic sound clips in the interactive beat machine and the creative team behind it wanted to construct a grid where each cell in the grid contained one clip. The users should be able find all possible combinations of 4 out of these 14 clips by looking at a subgrid with four cells (as in the picture above). The question, then, is how large the grid needs to be in order to fit all combinations and how the clips should be arranged so that no clip is ever paired with itself.&lt;br /&gt;&lt;br /&gt;In order to answer this, we need to find out in how many ways we can choose 4 out of 14 clips. We could do this by listing all possible combinations (clips 1, 2, 3, 4, clips 1, 2, 3, 5, clips 1, 2, 3, 6, ... clips 11, 12, 13, 14), but that would (a) take too long and (b) bore us to death.&lt;br /&gt;&lt;br /&gt;Not having a deathwish, I decided to make use of some tools from the mathematical branch of &lt;a href="http://en.wikipedia.org/wiki/Combinatorics"&gt;combinatorics&lt;/a&gt;. It just so happens that a classical combinatorial problem is the question "in how many different ways can we choose &lt;span style="font-style: italic;"&gt;k&lt;/span&gt; out of &lt;span style="font-style: italic;"&gt;n &lt;/span&gt;objects" (here &lt;span style="font-style: italic;"&gt;k &lt;/span&gt;and &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; are whole numbers and &lt;span style="font-style: italic;"&gt;k&lt;/span&gt; is less than or equal to &lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;Luckily, others have &lt;a href="http://www.mathsisfun.com/combinatorics/combinations-permutations.html"&gt;solved this problem&lt;/a&gt; before us, and we can use their solution right away. The answer to the question can be written as what mathematicians arrogantly refer to as a "simple" formula, using what is called the binomial coefficient (the strange thing on the left hand side in the expression below). The formula is&lt;br /&gt;&lt;br /&gt;&lt;a href="http://upload.wikimedia.org/math/2/7/5/275efadb2cec0589bc066836c3cd46cc.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 269px; height: 48px;" src="http://upload.wikimedia.org/math/2/7/5/275efadb2cec0589bc066836c3cd46cc.png" alt="" border="0" /&gt;&lt;/a&gt;which in this case, where &lt;span style="font-style: italic;"&gt;n=14&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;k=4&lt;/span&gt; becomes &lt;span style="font-style: italic;"&gt;(14*13*12*11)/(4*3*2*1)=1001&lt;/span&gt;. That is, there are 1001 possible ways to choose 4 out of 14 clips.&lt;br /&gt;&lt;br /&gt;So, we need to create a grid where all 1001 combinations of four clips can be found by looking at a subgrid of size &lt;span style="font-style: italic;"&gt;2*2&lt;/span&gt;, that is, a grid with two rows and two columns. The clips much be placed in such a way that, for instance, clip number 1 never is found next to itself in the big grid - when the user looks at four cells there must be four different clips placed in them.&lt;br /&gt;&lt;br /&gt;Consider the following small grids. In each of them there are two &lt;span style="font-style: italic;"&gt;2*2&lt;/span&gt; subgrids that the user can look at.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_h7In_HkTPwY/TQ9hNSife3I/AAAAAAAAAEU/MJcuz2kDA6I/s1600/grid.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 89px;" src="http://2.bp.blogspot.com/_h7In_HkTPwY/TQ9hNSife3I/AAAAAAAAAEU/MJcuz2kDA6I/s320/grid.jpg" alt="" id="BLOGGER_PHOTO_ID_5552763746513812338" border="0" /&gt;&lt;/a&gt;In the grid to the left, the combinations of clips 1, 2, 4, 5 (the four cells to the left in the grid) and clips 2, 3, 5, 6 (the four cells to the right in the grid) are found. In the middle grid, the combinations are 1, 2, 4, 5 and 1, 2, 5, 6. In the grid to the right however, the combination 1, 1, 2, 4 is found, which isn't permitted. That's the kind of subgrids that we want to avoid.&lt;br /&gt;&lt;br /&gt;Before we try to create the grid, we need to know how large it should be. How many subgrids of size &lt;span style="font-style: italic;"&gt;2*2&lt;/span&gt; are there in a grid with &lt;span style="font-style: italic;"&gt;a&lt;/span&gt; rows and &lt;span style="font-style: italic;"&gt;b&lt;/span&gt; columns? Let's look at some examples:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_h7In_HkTPwY/TQ9tp2vFDdI/AAAAAAAAAEs/eKfylkcT6a8/s1600/ab.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 369px; height: 327px;" src="http://2.bp.blogspot.com/_h7In_HkTPwY/TQ9tp2vFDdI/AAAAAAAAAEs/eKfylkcT6a8/s400/ab.jpg" alt="" id="BLOGGER_PHOTO_ID_5552777431406153170" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;From examples like this we can deduce that there is a general formula for the number of &lt;span style="font-style: italic;"&gt;2*2&lt;/span&gt; subgrids: in a grid with &lt;span style="font-style: italic;"&gt;a &lt;/span&gt;rows and &lt;span style="font-style: italic;"&gt;b&lt;/span&gt; columns there are &lt;span style="font-style: italic;"&gt;(a-1)*(b-1)&lt;/span&gt; subgrids of size &lt;span style="font-style: italic;"&gt;2*2&lt;/span&gt;. We need a grid with at least 1001 such subgrids in order to be able to fit all 1001 combinations in the grid.&lt;br /&gt;&lt;br /&gt;It seemed reasonable to have a grid where &lt;span style="font-style: italic;"&gt;a=b&lt;/span&gt;, that is, a grid with the same number of rows and columns. Then &lt;span style="font-style: italic;"&gt;(a-1)*(b-1)=(a-1)*(a-1)=(a-1)&lt;sup&gt;2&lt;/sup&gt;&lt;span style="font-style: italic;"&gt;=&lt;/span&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;a&lt;/span&gt;&lt;sup style="font-style: italic;"&gt;2&lt;/sup&gt;&lt;span style="font-style: italic;"&gt;-2*a+1&lt;/span&gt;. When &lt;span style="font-style: italic;"&gt;a=32&lt;/span&gt; we get that&lt;span style="font-style: italic;"&gt; a&lt;/span&gt;&lt;sup style="font-style: italic;"&gt;2&lt;/sup&gt;&lt;span style="font-style: italic;"&gt;-2*a+1=961&lt;/span&gt; and when &lt;span style="font-style: italic;"&gt;a=33&lt;/span&gt; we get that &lt;span style="font-style: italic;"&gt;a&lt;/span&gt;&lt;sup style="font-style: italic;"&gt;2&lt;/sup&gt;&lt;span style="font-style: italic;"&gt;-2*a+1=1024&lt;/span&gt;, which is larger than 1001. Thus 33 is the smallest value of &lt;span style="font-style: italic;"&gt;a &lt;/span&gt;that results in a grid where all combinations of 4 out the 14 clips can be placed.&lt;br /&gt;&lt;br /&gt;In summary, we've found that there are 1001 combinations of 4 out 14 clips and that we need a grid that has at least 33 rows and 33 columns if we want all of these combinations to be found in &lt;span style="font-style: italic;"&gt;2*2&lt;/span&gt; subgrids. Furthermore, we know what kind of subgrids we wish to avoid.&lt;br /&gt;&lt;br /&gt;At this point, placing the clips in the grid is like solving a massive &lt;span style="font-style: italic;"&gt;33*33&lt;/span&gt; sudoku with a strange set of rules. I did a bit of programming, using &lt;a href="http://www.r-project.org/"&gt;my favourite scripting language R&lt;/a&gt;, to let my computer do this for us. The resulting grid is one of the hidden cogs that is the magic behind the beat machine.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_h7In_HkTPwY/TQ9H84eXGaI/AAAAAAAAAD8/rVpWF7V3Wos/s1600/robyn1.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 201px;" src="http://4.bp.blogspot.com/_h7In_HkTPwY/TQ9H84eXGaI/AAAAAAAAAD8/rVpWF7V3Wos/s320/robyn1.jpg" alt="" id="BLOGGER_PHOTO_ID_5552735976848562594" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;(To complicate things a bit, the grid we used is "infinite" in the sense that the left border of the grid fits with the right border, so that if you go 34 steps to the right or left, or 34 steps up or down you will come back to the same 4 clips that you started with.)&lt;br /&gt;&lt;br /&gt;And then there is the matter of how to place the video clips in the grid... But that's another story.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;Robyn recently performed at the &lt;a href="http://nobelpeaceprize.org/concert/"&gt;Nobel peace prize concert&lt;/a&gt;. DinahMoe make awesomeness happen with interactive sounds and adaptive music. Check out &lt;a href="http://www.dinahmoe.com/"&gt;their website&lt;/a&gt; for more examples!&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-777250324985293502?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/777250324985293502/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2010/12/mathematics-of-beat-machine.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/777250324985293502'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/777250324985293502'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2010/12/mathematics-of-beat-machine.html' title='The mathematics of a beat machine'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_h7In_HkTPwY/TQ9H_yr89OI/AAAAAAAAAEE/j0l7eROMX_0/s72-c/robyn2.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4522943003528921495.post-3106371313276083490</id><published>2010-12-17T14:24:00.000+01:00</published><updated>2010-12-17T14:24:35.807+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data visualization'/><title type='text'>Facebook friendships and urbanization</title><content type='html'>Last Tuesday Paul Butler, an intern on Facebook's data infrastructure engineering team, posted &lt;a href="http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919"&gt;a map visualizing the "locality of friendship"&lt;/a&gt; on Facebook. Butler used data from friendships between the 500 million Facebook users to create a stunning visualization of the world's largest social network. The colour of the blue lines in represent friendship connections between cities and the white dots are the cities themselves (literally glowing from the density of the friendships within the city).&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQthVpW1rwI/AAAAAAAAADU/vkF1hcwieNA/s1600/163413_479288597199_9445547199_5658562_14158417_n.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 159px;" src="http://1.bp.blogspot.com/_h7In_HkTPwY/TQthVpW1rwI/AAAAAAAAADU/vkF1hcwieNA/s320/163413_479288597199_9445547199_5658562_14158417_n.jpg" alt="" id="BLOGGER_PHOTO_ID_5551637990171062018" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;As interesting as the brightly shining cities and interconnecting lines are (think for a minute about what the lines represent!), many of us found the dark areas to be even more interesting. As is often the case, the most interesting questions came not from what we &lt;i&gt;could&lt;/i&gt; see in the data set, but from what we &lt;i&gt;couldn't&lt;/i&gt; see.&lt;br /&gt;&lt;br /&gt;To that end, Thorsten Gätz produced a &lt;a href="http://www.flickr.com/photos/57029495@N05/5261467662/#/photos/thorstengaetz/5261467662/"&gt;beautiful map&lt;/a&gt; where the Facebook friends map was put on top of a world map where areas with a high population density were marked in red. He described it as "[a] map which puts the connections of facebook users into context and shows where other social networks have the upper hand."&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQthlTRiTpI/AAAAAAAAADc/FNVSxThjbes/s1600/5261467662_e8aa6841ee.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 160px;" src="http://1.bp.blogspot.com/_h7In_HkTPwY/TQthlTRiTpI/AAAAAAAAADc/FNVSxThjbes/s320/5261467662_e8aa6841ee.jpg" alt="" id="BLOGGER_PHOTO_ID_5551638259121147538" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;I am however not sure that population density is the right variable when trying to put the Facebook map into context. Parts of the third world are densely populated but not particularly urbanized. Urbanization should be highly correlated with social network usage, so perhaps that would be a more natural variable to look for in a comparison.&lt;br /&gt;&lt;br /&gt;Ten years ago, NASA used data from meteorological satellites to produce &lt;a href="http://earthobservatory.nasa.gov/IOTD/view.php?id=896"&gt;a picture of Earth's city lights&lt;/a&gt;. Brighter areas are more urbanized, so this map should be useful when we try to put the Facebook map into context. Luckily, NASA's map uses the same &lt;a href="http://en.wikipedia.org/wiki/Map_projection"&gt;projection&lt;/a&gt; as the Facebook map, so we can easily superimpose them on each other.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_h7In_HkTPwY/TQths0lTYGI/AAAAAAAAADk/QfWjekhoabA/s1600/earth_lights_lrg.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 160px;" src="http://4.bp.blogspot.com/_h7In_HkTPwY/TQths0lTYGI/AAAAAAAAADk/QfWjekhoabA/s320/earth_lights_lrg.jpg" alt="" id="BLOGGER_PHOTO_ID_5551638388321509474" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;My graphical software skills are sadly lacking when compared to those of Gätz, but after trying my best I came up with this picture:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQthxMv3MdI/AAAAAAAAADs/5UQ8xacXVu8/s1600/both.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 159px;" src="http://1.bp.blogspot.com/_h7In_HkTPwY/TQthxMv3MdI/AAAAAAAAADs/5UQ8xacXVu8/s320/both.jpg" alt="" id="BLOGGER_PHOTO_ID_5551638463527727570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The red (and glowing) dots and lines are from the Facebook map whereas the blue parts come from NASA's map.&lt;br /&gt;&lt;br /&gt;The conclusions that we can draw from this map differs somewhat from those drawn from Gätz's map. The most urbanized regions of Africa seems to coincide with the African regions on the Facebook map. Northeast Brazil, Russia and parts of eastern Europe, the middle east and China are either underrepresented or missing completely in the Facebook map. South Korea and Japan are clearly visible on the Facebook map, but even more so on the city lights map.&lt;br /&gt;&lt;br /&gt;The picture that emerges when looking at the Facebook and urbanization maps together coincides with that from a &lt;a href="http://www.vincos.it/world-map-of-social-networks/"&gt;world map of social network&lt;/a&gt; that was recently published over at Vincos blog, giving further support to the idea that urbanization is the right context for this map.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_h7In_HkTPwY/TQth3iaBrjI/AAAAAAAAAD0/6cqB5mQTNp4/s1600/WMSN1210-570.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 223px;" src="http://1.bp.blogspot.com/_h7In_HkTPwY/TQth3iaBrjI/AAAAAAAAAD0/6cqB5mQTNp4/s320/WMSN1210-570.png" alt="" id="BLOGGER_PHOTO_ID_5551638572420935218" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Dark areas of maps have always driven mankind to push further and explore the unknown. Today we can explore those dark areas from the comfort of our homes, with a little help of powerful computers and networks of satellites...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;a href="http://blog.revolutionanalytics.com/2010/12/facebooks-social-network-graph.html"&gt;Revolutions&lt;/a&gt; and &lt;a href="http://flowingdata.com/2010/12/13/facebook-worldwide-friendships-mapped/"&gt;FlowingData&lt;/a&gt; drew my attention to Butler's Facebook map.&lt;br /&gt;&lt;br /&gt;On a side note, it's also interesting to compare these pictures to &lt;a href="http://www.chrisharrison.net/projects/InternetMap/index.html"&gt; world connection density and city-to-city router connections map.&lt;/a&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4522943003528921495-3106371313276083490?l=lookingatdata.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://lookingatdata.blogspot.com/feeds/3106371313276083490/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://lookingatdata.blogspot.com/2010/12/facebook-friendships-and-urbanization.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/3106371313276083490'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4522943003528921495/posts/default/3106371313276083490'/><link rel='alternate' type='text/html' href='http://lookingatdata.blogspot.com/2010/12/facebook-friendships-and-urbanization.html' title='Facebook friendships and urbanization'/><author><name>Måns</name><uri>http://www.blogger.com/profile/01238303946435935100</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_h7In_HkTPwY/TQthVpW1rwI/AAAAAAAAADU/vkF1hcwieNA/s72-c/163413_479288597199_9445547199_5658562_14158417_n.jpg' height='72' width='72'/><thr:total>4</thr:total></entry></feed>
