tag:blogger.com,1999:blog-4522943003528921495.post3663766272148339705..comments2015-03-13T03:38:10.987+01:00Comments on Looking at data: Speeding up R computationsMåns Thulinhttps://plus.google.com/110929693721827303767noreply@blogger.comBlogger22125tag:blogger.com,1999:blog-4522943003528921495.post-42868812853116344782015-03-13T03:38:10.987+01:002015-03-13T03:38:10.987+01:00In order to notice the difference well you need to...In order to notice the difference well you need to increase the size of the vector.<br /><br />x<-rnorm(500000)<br /><br />system.time(for(i in 1:100000){mean(x)})<br />and<br />system.time(for(i in 1:100000){.Internal(mean(x))})<br /><br />In my computer they need almost the same time. 97.42s vs 95.96s.<br /><br />I've also tried with data.table and it's slower.<br /><br />DT <- data.table(xx=rnorm(500000))<br />system.time(for(i in 1:100000){DT[,mean(xx)]})<br /> 320s<br />What's the problem?. Isn't it supposed to be faster?<br /><br />Unless I move the loop inside, but still almost the same than the first one.<br /><br />now somebody should also compare it to dlpyr, and with versions with snow, foreach, Rcpp, cmpfun ,... my computer doesn't allow me to install the compiler package.<br />skanhttp://www.blogger.com/profile/03631114761711061847noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-32225533372424333012015-03-13T02:56:48.543+01:002015-03-13T02:56:48.543+01:00This comment has been removed by the author.skanhttp://www.blogger.com/profile/03631114761711061847noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-90453447441092585352015-03-13T02:55:02.521+01:002015-03-13T02:55:02.521+01:00This comment has been removed by the author.skanhttp://www.blogger.com/profile/03631114761711061847noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-28010690353954067982013-01-10T13:13:30.199+01:002013-01-10T13:13:30.199+01:00Using mean.default instead of generic mean will sa...Using mean.default instead of generic mean will save you time too. So choosing the right method takes some time - but if you think about it, you don't really need to choose the right method for 100000 times if you know the data are of the same type. the rest of the difference comes from processing the extra arguments (na.rm and trim)<br /><br />> x <- rnorm(100)<br />> system.time(for(i in 1:100000){mean(x)})<br /> user system elapsed <br /> 2.59 0.00 2.59 <br />> system.time(for(i in 1:100000){sum(x)/length(x)})<br /> user system elapsed <br /> 0.39 0.00 0.39 <br />> system.time(for(i in 1:100000){mean.default(x)})<br /> user system elapsed <br /> 0.6 0.0 0.6 <br /><br />But then, look at the code for mean.default and there's a good hint at the very end: <br /><br />> system.time(for(i in 1:100000){.Internal(mean(x))})<br /> user system elapsed <br /> 0.19 0.00 0.19 <br /><br />Which is about two times faster than your custom function. <br />Kenn Konstabelhttp://www.blogger.com/profile/08534243745199287998noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-10423391506079488782011-04-10T09:39:55.948+02:002011-04-10T09:39:55.948+02:00When the computational power goes up the will to o...When the computational power goes up the will to optimize usualy fades, resulting in blocking of clusters and ridicoulus amounts of unnecessary data. So keep up the optimization! =)Keyhttp://www.realglitch.comnoreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-79506440898126042382011-04-08T13:49:49.967+02:002011-04-08T13:49:49.967+02:00Also, regarding your exponential findings... I was...Also, regarding your exponential findings... I was wrong that x^2 and x^3 were special cases when it's slower.... at least I was wrong in implying that this is always true<br /><br />I believe it's machine dependent upon implementation of the pow() function in C (which it relies on). <br /><br />On my laptop I discovered that x^2 is just as fast as x*x for pretty much any array. Then, after that, while x^n was a fixed speed (no difference for fairly high n's to 20), x*x... was of course slower for every added x. The jump from x^2 to x^3 was very large so x^3 you wouldn't want to use the exponent, but if the exponent could vary to a large number then you most definitely would.jchttp://www.blogger.com/profile/00227235335343168838noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-73580335470963351922011-04-08T13:39:13.491+02:002011-04-08T13:39:13.491+02:00Thanks jc, that was really helpful. Now I have lot...Thanks jc, that was really helpful. Now I have lots of functions that I want to investigate more closely :)Månshttp://www.blogger.com/profile/01238303946435935100noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-36921835828234189192011-04-08T13:37:36.231+02:002011-04-08T13:37:36.231+02:00Måns, generic functions, like mean(), may not have...Måns, generic functions, like mean(), may not have much code revealed by typing 'mean' at the command line. However, you can get an idea of what you do need to type by checking methods(mean). This will list all of the various mean methods that you currently have, one of which will be mean.default(). That's the one you check the code of.<br /><br />This is true for other generic functions as well.jchttp://www.blogger.com/profile/00227235335343168838noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-21148765410108954042011-04-07T15:45:16.502+02:002011-04-07T15:45:16.502+02:00Using the .Internal method, my improved code is no...Using the .Internal method, my improved code is now 7 times faster than it was before I started to look into this. Nice!Månshttp://www.blogger.com/profile/01238303946435935100noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-50801433398395822282011-04-07T13:40:06.561+02:002011-04-07T13:40:06.561+02:00Great link, Julyan. Looks like I may have to brush...Great link, Julyan. Looks like I may have to brush up on my C skills :)Månshttp://www.blogger.com/profile/01238303946435935100noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-59084259048952304242011-04-07T13:22:57.153+02:002011-04-07T13:22:57.153+02:00Another way for speeding up R code is to interface...Another way for speeding up R code is to interface C code within it, is quite easy, see here for a simple example: http://statisfaction.wordpress.com/2011/02/04/speed-up-your-r-code-with-c/julyanarbelhttp://statisfaction.wordpress.com/noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-37600465084601778092011-04-07T12:31:08.918+02:002011-04-07T12:31:08.918+02:00I received a notification about a comment that I c...I received a notification about a comment that I can't see in the above list, but it had an interesting link that I thought I'd share: http://www.johndcook.com/blog/2008/11/05/how-to-calculate-pearson-correlation-accurately/<br /><br />Also, it was pointed out in that comment that my post mainly concerns known pitfalls. So just to be clear, I'm not trying to claim that I've discovered new caveats, but rather wanted to comment on some things that were new to me.Månshttp://www.blogger.com/profile/01238303946435935100noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-37240879392420009922011-04-07T10:15:36.819+02:002011-04-07T10:15:36.819+02:00My results: > x <- rnorm(50)
> y <- su...My results: > x <- rnorm(50)<br />> y <- sum(x)<br />> z <- length(x)<br />> Mean <- function(x){Mean = sum(x)/length(x)}<br />> system.time(for(i in 1:100000){mean(x)})<br /> User System verstrichen <br /> 2.61 0.05 2.74 <br />> system.time(for(i in 1:100000){Mean(x)})<br /> User System verstrichen <br /> 0.52 0.00 0.52 <br />> system.time(for(i in 1:100000){sum(x)/50})<br /> User System verstrichen <br /> 0.25 0.00 0.25 <br />> system.time(for(i in 1:100000){y/z})<br /> User System verstrichen <br /> 0.15 0.02 0.17 <br /><br />So, defining your own simple function saves about 80% of time, and you can trim that by two thirds by calculating the components beforehand. But nice one about the .internal command.Owehttp://www.blogger.com/profile/01963550151776199868noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-5883964127523728882011-04-07T10:08:14.604+02:002011-04-07T10:08:14.604+02:00Thanks for the great comments, people!
And thanks...Thanks for the great comments, people!<br /><br />And thanks for the tips about .Internal. I've never used that before, but it really seems to be the way to go here. I'm a bit surprised that the documentation for mean() fails to mention it.<br /><br />eduardo: Thanks, that was interesting to read!<br /><br />jc: That a^6 thing is funny; before publishing the blog post I thought to myself that I ought to check whether the exponent still was slower for higher products. Clearly I forgot to. :)<br /><br />Sean X: Right, you certainly have a number of valid points. Since mean() is a high level function I expected it to be a bit slower, but not THAT much slower, which was what I was trying to say. Sorry if "embarrassing" came off sounding too strong - that's always a danger for someone like me, who's not a native speaker. I tried to look at the source code for mean() in R (by simply typing the function's name), but that only says "UseMethod("mean")" and I didn't know where to go from there. I guess that I have to go directly to the C source to find out how mean() works?Månshttp://www.blogger.com/profile/01238303946435935100noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-12398854215557266512011-04-07T01:11:41.282+02:002011-04-07T01:11:41.282+02:00The .Internal method is really quite striking. Se...The .Internal method is really quite striking. See the following:<br /><br />a<-rnorm(100000000)<br /><br />> system.time(for(a in 1:100000) mean(a))<br /> user system elapsed <br /> 1.319 0.019 1.338 <br />> system.time(for(a in 1:100000) mean.default(a))<br /> user system elapsed <br /> 0.478 0.001 0.480 <br />> system.time(for(a in 1:100000) .Internal(mean(a)))<br /> user system elapsed <br /> 0.030 0.001 0.031jebyrneshttp://imachordata.comnoreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-71815690199190260362011-04-06T23:50:57.654+02:002011-04-06T23:50:57.654+02:00The slow down is not "embarassing". It&...The slow down is not "embarassing". It's expected because mean() does a lot more than sum(x)/length(x), and it should. For example, it ensures arguments are numeric and gives a warning when they aren't, and it removes NA values if na.rm=TRUE. Also, your length() denominator won't work when x contains NA's.<br /><br />As others have said, calling .Internal(mean(x)) is much faster than any of your alternatives, and that's because it's calling direcly to the C code. In fact, it's calling the SAME function that does sum() (do_summary), but with different flags.<br /><br />Since R makes it easy to view the source code, you could have done so and determined in your code that whatever you're passing has no possibility of being non-numeric, and doesn't require na.rm or trim features of the mean() function. Then you should be using .Internal(mean()) rather than mean().Sean Xhttp://www.blogger.com/profile/13034298596610902125noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-25071068013163919032011-04-06T23:39:31.760+02:002011-04-06T23:39:31.760+02:00OK, now I'm up to 3 comments but I really mean...OK, now I'm up to 3 comments but I really meant to suggest generalizing my method for mean. You can calculate variance really fast if you know the data going in with the function .Internal(cov(x, NULL, 1, FALSE)). It's 20x faster than var().jchttp://www.blogger.com/profile/00227235335343168838noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-24818759708729565852011-04-06T23:28:02.019+02:002011-04-06T23:28:02.019+02:00Your dilemma is easily solved with regards to easy...Your dilemma is easily solved with regards to easy to write code and your specific example. x^6 is faster than x*x*x*x*x*x. x*x is a special case and one of only two where multiply beats exponent (x*x*x works as well).jchttp://www.blogger.com/profile/00227235335343168838noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-20992356817592933752011-04-06T23:20:09.438+02:002011-04-06T23:20:09.438+02:00try mean.default() instead of just plain old mean(...try mean.default() instead of just plain old mean(). That'll get you from 1/20th the speed to 1/2 the speed from mean. Then look at the code of mean.default to see where the rest of the slowdown comes from.<br /><br />When you do that you'll see the simplest call to mean(); the one most comparable to the much simpler function sum(). If you try .Internal(mean(x)) you'll be twice as fast as sum(x)/length(x).jchttp://www.blogger.com/profile/00227235335343168838noreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-66157745193513293162011-04-06T22:50:00.859+02:002011-04-06T22:50:00.859+02:00I noticed the slowness of the built in functions w...I noticed the slowness of the built in functions when i had to count a large number of jackknife correlations in a bigish gene expression data set.<br /><br />looping cor() was incredibly slow and the jackknife function of the bootstrap (?) package was a disaster.<br /><br />I managed to work around it, though my solution is probably far from optimal (biologist!!), by McGyvering my own cor-function the quite fast rowSums/rowMeans functions. In the end I got my processing time down speeded up by a ton and the analysis done over night in stead of in a week<br /><br />/Cheers from LundAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-46308192913534560352011-04-06T21:46:45.498+02:002011-04-06T21:46:45.498+02:00be careful with numerical instabilities that aris...be careful with numerical instabilities that arise, e.g. when calculating variances http://en.wikipedia.org/wiki/Algorithms_for_calculating_varianceeduardohttp://eduardoleoni.comnoreply@blogger.comtag:blogger.com,1999:blog-4522943003528921495.post-26398330769063355562011-04-06T20:33:40.449+02:002011-04-06T20:33:40.449+02:00Wow, thats really interesting (to me, at least). T...Wow, thats really interesting (to me, at least). Thanks for the post.<br /><br />That being said, i suspect a reason for the poor performance of mean and var is coming from both their need to check the length of the vector and the checks they presumably run for NA's.<br /><br />Then again, I think mean fails when you supply it with NA's without specifying the action to take (unless you change the default options).<br /><br />It does seem somewhat surprising that the call to length can make that much difference though.Disgruntled PhDhttp://www.blogger.com/profile/00926204336056169207noreply@blogger.com