Tuesday, January 8, 2013

Speeding up R computations Pt III: parallelization

In two previous posts, I have written about how you can speed up your R computations either by using strange notation and non-standard functions or by compiling your code. Last year my department bought a 64-core computational server, which allowed me to get some seriously improved performance by using parallelization. This post is about four insights that I've gained from my experiments with parallel programming in R. These are, I presume, pretty basic and well-known, but they are things that I didn't know before and at least the last two of them came as something of a surprise to me.

I'll start this post by a very brief introduction to parallelization using the doMC package in R, in which I introduce a toy example with matrix inversions. I will however not go into detail about how to use the doMC package.  If you already are familiar with doMC, you'll probably still want to glance at the matrix inversion functions before you skip ahead to the four points about what I've learned from my experiments, as these functions will occur again later in the text.

Parallel programming in R
Modern processors have several cores, which essentially are independent CPUs that can work simultaneously. When your code is parallelized several computations are carried out in parallel on the different cores. This allows you to, for instance, run several steps of a for-loop in parallel. In R it is simple to do this using e.g. the doMC package.

As an example, imagine that you wish to generate and invert several random 200 by 200 matrices and save the inverted matrices in a list. Using a standard for loop, you can do this using the following function:
# A standard function:
myFunc<-function(B,size)
{
   a<-vector("list", B)
   for(i in 1:B)
   {
    myMatrix<-matrix(runif(size^2),size,size)
    a[[i]]<-solve(myMatrix)
   }
   return(a)
}
Using the doMC package, we can write a parallelized version of this function as follows:
library(doMC) # Use doMC package
registerDoMC() # Determine number of cores to use, default = #cores/2
# A parallelized function:
myParFunc<-function(B,size)
{
    foreach(i = 1:B) %dopar% 
    {
        myMatrix<-matrix(runif(size^2),size,size)
        solve(myMatrix)
    }
}
The parallelized foreach %dopar% loops always returns a list. If you are used to using for-loops in which the output is manually stored in vectors or matrices, this may take some time getting used to. The list structure is actually very useful, and by using unlist, data.frame and Reduce in a clever way you probably won't have to make a lot of changes in your old code.

Comparing the two functions, we find that the parallelized function is substantially faster:
>  system.time(myFunc(1000,200))
   user  system elapsed 
 19.026   0.148  19.186 
>  system.time(myParFunc(1000,200))
   user  system elapsed 
 24.982   8.448   5.576
The improvement is even greater for larger matrices. Parallelization can save you a lot of time! Unfortunately, there are situations where it actually makes your code slower, as we will see next.

1. Parallelization is only useful when you have computation-heavy loop steps
Each parallelization has an overhead cost in that it takes some time to assign the different loop steps to the different cores. If each step of your loops is quickly executed, the overhead cost might actually be larger than the gain from the parallelization! As an example, if you wish to generate and invert several small matrices, say of size 25 by 25, the overhead cost is too large for parallelization to be of any use:
>  system.time(myFunc(1000,25))
   user  system elapsed 
  0.264   0.004   0.266 
>  system.time(myParFunc(1000,25))
   user  system elapsed 
  0.972   1.480   1.137
In this case, the parallelized function is much slower than the standard unparallelized version. Parallelization won't magically make all your loops faster!

2. Only parallelize the outer loops
If you are using nested loops, you should probably refuse the urge to replace all your for-loops by parallelized loops. If you have parallelized loops within parallelized loops, you tend to lose much of the gain from the outer parallelization, as overhead costs increase. This is a special case of point 1 above: it is probably the entire inner loop that makes each step of the outer loop computation-heavy, rather than a single step in the inner loop.

Returning to the matrix inversion example, let's now assume that we wish to use an inner loop to generate a new type of random matrices by multiplying with other matrices with uniformly distributed elements. The unparallelized and parallelized versions of this are as follows:
# No parallelization:
myFunc<-function(B,size)
{
   a<-vector("list", B)
   for(i in 1:B)
   {
    myMatrix<-matrix(runif(size^2),size,size)
    for(j in 1:B)
    {
        myMatrix<-myMatrix %*% matrix(runif(size^2),size,size)
    }
    a[[i]]<-solve(myMatrix)
   }
}

# Outer loop parallelized:
myParFunc<-function(B,size)
{
    foreach(i = 1:B) %dopar% 
    {
        myMatrix<-matrix(runif(size^2),size,size)
        for(j in 1:B)
        {
         myMatrix<-myMatrix %*% matrix(runif(size^2),size,size)
        }
        solve(myMatrix)
    }
}

# Both loops parallelized:
myParFunc2<-function(B,size)
{
    foreach(i = 1:B) %dopar% 
    {
        myMatrix<-matrix(runif(size^2),size,size)
        myMatrix<-myMatrix %*% Reduce("%*%",foreach(j = 1:B) %dopar%
        {
         matrix(runif(size^2),size,size)
        })
        solve(myMatrix)
    }
}
When the outer loop is parallelized, the performance is much better than when it isn't:
>  system.time(myFunc(1000,25))
   user  system elapsed 
 89.305   0.012  89.381 
>  system.time(myParFunc(1000,25))
   user  system elapsed 
 46.767   0.780   5.239 
However, parallelizing the inner loop as well slows down the execution:
>   system.time(myParFunc2(1000,25))
   user  system elapsed 
970.809 428.371  51.302 
If you have several levels of nested loops, it may be beneficial to parallelize the two or three outmost loops. How far you should parallelize depends on what is going on inside each loop. The best way to find out how far you should go is to simply try different levels of parallelization.

3. More is not always merrier when it comes to cores
Using a larger number of cores means spending more time on assigning tasks to different cores. When I tested the code for a large-scale simulation study by using different number of cores on our 64-core number cruncher, it turned out that the code executed much faster when I used only 16 cores instead of 32 or 64. Using 16 cores, I might add, was however definitely faster than using 8 cores. Then again, in another simulation, using 32 cores turned out to be faster than using 16. Less is not always merrier either.

An example where a larger number of cores slows down the execution is the following, which is a modified version of myParFunc2 above, in which the multiplication has been replaced by addition to avoid issues with singular matrices:

myParFunc3<-function(B,size)
{
    foreach(i = 1:B) %dopar% 
    {
        myMatrix<-matrix(runif(size^2),size,size)
        myMatrix<-myMatrix %*% Reduce("+",foreach(j = 1:B) %dopar%
        {
         matrix(runif(size^2),size,size)
        })
        solve(myMatrix)
    }
}

> registerDoMC(cores=16)
> system.time(myParFunc3(1000,75))
    user   system  elapsed 
2095.571 1340.748  137.397 

> registerDoMC(cores=32)
> system.time(myParFunc3(1000,75))
    user   system  elapsed 
2767.289 2336.450  104.773 

> registerDoMC(cores=64)
> system.time(myParFunc3(1000,75))
    user   system  elapsed 
2873.555 3522.984  112.809

In this example, using 32 cores if faster than using either 16 or 64 cores. It pays off to make a small trial in order to investigate how many cores you should use for a particular parallelized function!

4. Make fair comparisons to unparallelized code
The doMC package includes the option of running foreach-loops without parallelization by simply typing %do% instead of %dopar%. This is nice as it lets you compare the performance of parallelized and unparallelized loops without having to rewrite the code too much. It is however important to know that the %do% loop behaves differently than the standard for loop that you probably would use if you weren't going to parallelize your code. An example is given by the matrix-inversion functions from point 1 above. The %do% function for this is
# A standard (?) function:
myFunc2<-function(B,size)
{
    foreach(i = 1:B) %do% 
    {
        myMatrix<-matrix(runif(size^2),size,size)
        solve(myMatrix)
    }
}
But this function is noticeably slower than the standard function described above:

>  system.time(myFunc(1000,25))
   user  system elapsed 
  0.264   0.004   0.266
>  system.time(myParFunc2(1000,25))
   user  system elapsed 
  0.952   0.044   0.994

When comparing parallelized and unparallelized loops, avoid %doloops, as these make the parallelized loops seem better than they actually are.

I'm honestly not sure why there is such a large discrepancy between the two unparallelized for-loops. I posted a question on this matter on Stack Overflow last year, but to date it has not received any satisfactory answers. If you happen to know wherein the difference lies, I'd much appreciate if you would let me know.

That's all for now. These are my four insights from playing around with parallelization in R. It is entirely possible that they contradict some well-known principles of parallel programming - I confess that I am almost completely ignorant about the literature on parallelization. Please let me know in the comments if you disagree with any of the points above or if you have counterexamples
!

For more on parallelization in R, check out R bloggers!

doMc is only one of many many packages for parallelization in R.