I'll start this post by a very brief introduction to parallelization using the doMC package in R, in which I introduce a toy example with matrix inversions. I will however not go into detail about how to use the doMC package. If you already are familiar with doMC, you'll probably still want to glance at the matrix inversion functions before you skip ahead to the four points about what I've learned from my experiments, as these functions will occur again later in the text.
Parallel programming in R
Modern processors have several cores, which essentially are independent CPUs that can work simultaneously. When your code is parallelized several computations are carried out in parallel on the different cores. This allows you to, for instance, run several steps of a for-loop in parallel. In R it is simple to do this using e.g. the doMC package.
As an example, imagine that you wish to generate and invert several random 200 by 200 matrices and save the inverted matrices in a list. Using a standard for loop, you can do this using the following function:
Using the doMC package, we can write a parallelized version of this function as follows:# A standard function: myFunc<-function(B,size) { a<-vector("list", B) for(i in 1:B) { myMatrix<-matrix(runif(size^2),size,size) a[[i]]<-solve(myMatrix) } return(a) }
library(doMC) # Use doMC package
registerDoMC() # Determine number of cores to use, default = #cores/2
# A parallelized function:
myParFunc<-function(B,size)
{
foreach(i = 1:B) %dopar%
{
myMatrix<-matrix(runif(size^2),size,size)
solve(myMatrix)
}
}
The parallelized foreach %dopar% loops always returns a list. If you are used to using for-loops in which the output is manually stored in vectors or matrices, this may take some time getting used to. The list structure is actually very useful, and by using unlist, data.frame and Reduce in a clever way you probably won't have to make a lot of changes in your old code.Comparing the two functions, we find that the parallelized function is substantially faster:
The improvement is even greater for larger matrices. Parallelization can save you a lot of time! Unfortunately, there are situations where it actually makes your code slower, as we will see next.> system.time(myFunc(1000,200)) user system elapsed 19.026 0.148 19.186 > system.time(myParFunc(1000,200)) user system elapsed 24.982 8.448 5.576
1. Parallelization is only useful when you have computation-heavy loop steps
Each parallelization has an overhead cost in that it takes some time to assign the different loop steps to the different cores. If each step of your loops is quickly executed, the overhead cost might actually be larger than the gain from the parallelization! As an example, if you wish to generate and invert several small matrices, say of size 25 by 25, the overhead cost is too large for parallelization to be of any use:
In this case, the parallelized function is much slower than the standard unparallelized version. Parallelization won't magically make all your loops faster!> system.time(myFunc(1000,25)) user system elapsed 0.264 0.004 0.266 > system.time(myParFunc(1000,25)) user system elapsed 0.972 1.480 1.137
2. Only parallelize the outer loops
If you are using nested loops, you should probably refuse the urge to replace all your for-loops by parallelized loops. If you have parallelized loops within parallelized loops, you tend to lose much of the gain from the outer parallelization, as overhead costs increase. This is a special case of point 1 above: it is probably the entire inner loop that makes each step of the outer loop computation-heavy, rather than a single step in the inner loop.
Returning to the matrix inversion example, let's now assume that we wish to use an inner loop to generate a new type of random matrices by multiplying with other matrices with uniformly distributed elements. The unparallelized and parallelized versions of this are as follows:
# No parallelization:
myFunc<-function(B,size)
{
a<-vector("list", B)
for(i in 1:B)
{
myMatrix<-matrix(runif(size^2),size,size)
for(j in 1:B)
{
myMatrix<-myMatrix %*% matrix(runif(size^2),size,size)
}
a[[i]]<-solve(myMatrix)
}
}
# Outer loop parallelized:
myParFunc<-function(B,size)
{
foreach(i = 1:B) %dopar%
{
myMatrix<-matrix(runif(size^2),size,size)
for(j in 1:B)
{
myMatrix<-myMatrix %*% matrix(runif(size^2),size,size)
}
solve(myMatrix)
}
}
# Both loops parallelized:
myParFunc2<-function(B,size)
{
foreach(i = 1:B) %dopar%
{
myMatrix<-matrix(runif(size^2),size,size)
myMatrix<-myMatrix %*% Reduce("%*%",foreach(j = 1:B) %dopar%
{
matrix(runif(size^2),size,size)
})
solve(myMatrix)
}
}
When the outer loop is parallelized, the performance is much better than when it isn't:> system.time(myFunc(1000,25))
user system elapsed
89.305 0.012 89.381
> system.time(myParFunc(1000,25))
user system elapsed
46.767 0.780 5.239
However, parallelizing the inner loop as well slows down the execution:> system.time(myParFunc2(1000,25))
user system elapsed
970.809 428.371 51.302
If you have several levels of nested loops, it may be beneficial to parallelize the two or three outmost loops. How far you should parallelize depends on what is going on inside each loop. The best way to find out how far you should go is to simply try different levels of parallelization.3. More is not always merrier when it comes to cores
Using a larger number of cores means spending more time on assigning tasks to different cores. When I tested the code for a large-scale simulation study by using different number of cores on our 64-core number cruncher, it turned out that the code executed much faster when I used only 16 cores instead of 32 or 64. Using 16 cores, I might add, was however definitely faster than using 8 cores. Then again, in another simulation, using 32 cores turned out to be faster than using 16. Less is not always merrier either.
An example where a larger number of cores slows down the execution is the following, which is a modified version of myParFunc2 above, in which the multiplication has been replaced by addition to avoid issues with singular matrices:
myParFunc3<-function(B,size)
{
foreach(i = 1:B) %dopar%
{
myMatrix<-matrix(runif(size^2),size,size)
myMatrix<-myMatrix %*% Reduce("+",foreach(j = 1:B) %dopar%
{
matrix(runif(size^2),size,size)
})
solve(myMatrix)
}
}
> registerDoMC(cores=16)
> system.time(myParFunc3(1000,75))
user system elapsed
2095.571 1340.748 137.397
> registerDoMC(cores=32)
> system.time(myParFunc3(1000,75))
user system elapsed
2767.289 2336.450 104.773
> registerDoMC(cores=64)
> system.time(myParFunc3(1000,75))
user system elapsed
2873.555 3522.984 112.809
In this example, using 32 cores if faster than using either 16 or 64 cores. It pays off to make a small trial in order to investigate how many cores you should use for a particular parallelized function!
4. Make fair comparisons to unparallelized code
The doMC package includes the option of running foreach-loops without parallelization by simply typing %do% instead of %dopar%. This is nice as it lets you compare the performance of parallelized and unparallelized loops without having to rewrite the code too much. It is however important to know that the %do% loop behaves differently than the standard for loop that you probably would use if you weren't going to parallelize your code. An example is given by the matrix-inversion functions from point 1 above. The %do% function for this is
But this function is noticeably slower than the standard function described above:# A standard (?) function: myFunc2<-function(B,size) { foreach(i = 1:B) %do% { myMatrix<-matrix(runif(size^2),size,size) solve(myMatrix) } }
> system.time(myFunc(1000,25)) user system elapsed 0.264 0.004 0.266
> system.time(myParFunc2(1000,25)) user system elapsed 0.952 0.044 0.994
When comparing parallelized and unparallelized loops, avoid %do% loops, as these make the parallelized loops seem better than they actually are.
I'm honestly not sure why there is such a large discrepancy between the two unparallelized for-loops. I posted a question on this matter on Stack Overflow last year, but to date it has not received any satisfactory answers. If you happen to know wherein the difference lies, I'd much appreciate if you would let me know.
That's all for now. These are my four insights from playing around with parallelization in R. It is entirely possible that they contradict some well-known principles of parallel programming - I confess that I am almost completely ignorant about the literature on parallelization. Please let me know in the comments if you disagree with any of the points above or if you have counterexamples!
For more on parallelization in R, check out R bloggers!
doMc is only one of many many packages for parallelization in R.
I haven't looked into the code of "foreach" but is the returned list actually preallocated to the the loop size (i.e. numeric("list", 100)) or is it grown in each step (i.e. list()), which would slow things down considerably?
ReplyDeleteCheers,
Andrej
That could be it actually! I would however expect the people at Revolution Analytics (who wrote foreach) to know R well enough to properly allocate a correctly sized list... Hmm. I'll have to look into the code for foreach more closely!
DeleteI looked at some very important and to maintain the length of the strength you are looking for on your websitehttps://360digitmg.com/course/certification-program-in-data-science
ReplyDeleteThis is a great post I saw thanks to sharing. I really want to hope that you will continue to share great posts in the future.
ReplyDeleteartificial intelligence course in noida
Its most perceptibly horrendous piece was that the item just worked spasmodically and the data was not exact. You unmistakably canot confront anyone about what you have found if the information isn't right.
ReplyDeletetraining provider in malaysia
I’ve read some good stuff here. Definitely worth bookmarking for revisiting. I surprise how much effort you put to create such a great informative website.
ReplyDeletedata science training
I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!
ReplyDeleteData Science Training
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.I want to share aboutdata analytics courses in yelahanka
ReplyDeleteGreat job for publishing such a beneficial web site. Your web log isn’t only useful but it is additionally really creative too.
ReplyDeleteWeb Design Cheltenham
SEO Gloucester
This post is very simple to read and appreciate without leaving any details out. Great work!
ReplyDeletedata scientist course in aurangabad
How to Deposit and Withdraw at Harrah's Ak-Chin Casino in
ReplyDeleteWithdrawal methods: Paypal, Skrill, Skrill, Skrill, 여수 출장안마 Skrill, 안양 출장샵 Skrill, Skrill 군산 출장마사지 The minimum deposit for the casino 용인 출장마사지 is $10 and the maximum bonus 익산 출장마사지 amount
There is more to be gained through training to earn your cybersecurity certificate. The real learning about the subject of cybersecurity should be the number one thrust behind earning your certificate. Cybersecurity Analytics
ReplyDeleteHire ReactJS Developers from CronJ to leverage 9+ years of React handling and 15+ industrial experience at just $8 per hour!
ReplyDeletehire reactjs developers
hire react developer
Hi.
ReplyDeleteAs I read your post from the beginning, I found it incredibly engaging and interesting. I want to express my gratitude to you for sharing such a fantastic blog. Please continue to provide regular updates. Here is sharing some NetSuite Technical Training information may be its helpful to you.
NetSuite Technical Training
"Speeding Up R Computations Pt. III" suggests a series of discussions or tutorials Why Games Online focused on enhancing the performance of computations in the R programming language.
ReplyDeleteAmplify your presence in Gloucester! Seo Gloucester mastery for top rankings, increased visibility, and business growth. Elevate your online game today!
ReplyDelete